Reading through the lengthy debate on the lkml titled "silent semantic changes with reiser4" [story] is a time investment. Comprised of well over 500 emails and growing, I include here a tiny snippet containing a discussion primarily between Hans Reiser, Andrew Morton [interview], and Linus Torvalds. Questions raised include whether or not the filesystem should be ultimately merged into the mainline kernel, and if so how to go about this. Much of the debate is regarding extensions that are currently only available through reiser4, and perhaps not fully compatible with existing utilities. The thread within begins with some coments by Andrew, who suggests that if the provided feature set is the desired direction for the Linux kernel, his preference would be to "accept the reiser4-only extensions with a view to turning them into kernel-wide extensions at some time in the future, so all filesystems will offer the extensions (as much as possible)".
As quoted earlier [story], Hans stressed that it was important that the reiser4 functionality be merged so that Linux is capable of competing with WinFS and Spotlight. The argument was continued by others, and to these followup comments Linus retorted:
"Hell will freeze over before Microsoft does a filesystem right. Besides, WinFS is likely almost in user mode anyway, ie mostly a library, rather like the gnome people are already doing with gnome storage. So there's really no point in trying to push your agenda by trying to scare people with MS activities. Linux kernel developers do what's right because it is _right_, not because somebody else does it."
As expected, merging Resier4 into Andrew Morton [interview]'s -mm tree [story] brought with it a lot of additional features and semantic changes. Christoph Hellwig expressed some unhappiness over these semantic changes, spawning a lengthy thread on the lkml. Specifically, he mentioned that the handling of files-as-directories (multiple streams within files) could cause problems to user-space applications, and could cause other dcache problems.
A lot of opposition was expressed. Some mentioned that the handling of multiple streams is really a userspace issue, whereas others mentioned that legacy applications may not properly handle multiple streams which could lead to the loss of user data. This lead Hans Reiser to say in support:
"Andrew, we need to compete with WinFS and Dominic Giampaolo's filesystem for Apple, and that means we need to put search engine and database functionality into the filesystem."
With the release of 18.104.22.168-mm2, in addition to a probable fix for the memory leak some reported when writing audio CDs [story], Andrew Morton [interview] announced that the long awaited [story] Resier4 filesystem has been merged into his -mm patchset [story]. Hans Reiser provided some information about the history and usage of Reiser4, beginning:
"Reiser4 is a file system based on dancing tree algorithms, and is described at http://www.namesys.com. One should be able to get it up and running just like any of the other filesystems supported by Linux. Configure it to be compiled either builtin or as a module. Create reiser4 filesystem with mkfs.reiser4, mount and use it. More detailed info can be found at http://thebsh.namesys.com/snapshots/LATEST/READ.ME."
In question and answer format - questions from Andrew, answers from Hans - we go on to learn where to obtain the latest Reiser4 tools, and about the current limitations of the filesystem. Hans explains, "Reiser4 has [only] been tested on i386 [...]. Quota support is not ready yet. Should be ready soon. [...] Only the very core functionality is working. Exotic plugins, an API for multiple operation transactions and accessing multiple small files in one syscall, compression, inheritance, all have been postponed until after the core functionality is shipped." Read on for more information, including tips on benchmarking the new filesystem.
An interesting thread on the lkml began when Greg KH submitted a patch for the 2.6 kernel saying, "Ok, to test out the new development model, here's a nice patch that simply removes the devfs code." This was quickly followed with a comment by Oliver Neukum who said, "may I point out that 2.6 is supposed to be a _stable_ series?" In one branch of the thread, the usefulness of devfs was examined.
In another thread, discussion was focused on this "new development model". Jonathan Corbet explained that Linus Torvalds and Andrew Morton [interview] were very happy with the results of their recent teamwork, and saw no immediate pressure to fork a 2.7 development branch. On the contrary, they intend to keep at it as they've been, with things first going into Andrew's -mm patchset [story] for testing, then eventually being merged into the mainline 2.6 kernel. Jonathan went on to explain, "Andrew stated his willingness to consider, for example, four-level page tables, MODULE_PARM removal, API changes, and more. 2.7 will only be created when it becomes clear that there are sufficient patches which are truly disruptive enough to require it. When 2.7 *is* created, it could be highly experimental, and may turn out to be a throwaway tree." And he summarized:
"Andrew's vision, as expressed at the summit, is that the mainline kernel will be the fastest and most feature-rich kernel around, but not, necessarily, the most stable. Final stabilization is to be done by distributors (as happens now, really), but the distributors are expected to merge their patches quickly."
Continuing the earlier discussion about low latency and Ingo Molnar [interview]'s voluntary kernel preemption patch [story], the conversation moved onto the affect a filesystem can have on latency. Specifically, 2.6 maintainer Andrew Morton [interview] noted that ReiserFS was known to have some latency issues in both the 2.4 and 2.6 Linux kernels, "resierfs: yes, it's a problem. I 'fixed' it multiple times in 2.4, but the fixes ended up breaking the fs in subtle ways and I eventually gave up." However, he did go on to note, "actually, the 2.4 low-latency patch does still have some reiserfs fixes, so it's probably better than reiserfs in 2.6."
When asked if ext3 was a better choice for low latency work, Andrew Morton replied, "ext3 is certainly better than [reiserfs], but still has a couple of potential problem spots. ext2 is probably the best at this time." Data is continuing to be collected and reviewed by a number of kernel developers, so the more noticeable latency issues in the 2.6 kernel will likely be addressed soon.
Those following the evolution of the Reiser4 filesystem will be interested in learning that it has become "fairly stable for average users", so much so that Namesys is soon planning to push patches [story] to 2.6 kernel maintainer Andrew Morton [interview]. Once the two remaining known bugs are fixed, the warnings against using reiser4 on a production system will likely be removed. Hans Reiser explains:
"We have one NFS related bug remaining, and one mmap all of memory related bug (and performance issue) that you can hit using iozone. We will fix both of these in next week's snapshot, they were both multi-day bug fixes. When they are fixed, unless users/distros find bugs next week we will submit it for inclusion in the -mm and then the official kernel."
Hans goes on to note, "We need a lot more real user testers, because we have run out of scripts that can crash it, and there are distros that would like to ship it soon." Read on for the full thread, including links to the latest snapshot and changelog.
It was recently pointed out that the stock 2.6.2 kernel contains in-kernel support for kgdb for some architectures, but not i386. 2.6 maintainer Andrew Morton replied, "lots of architectures have had in-kernel kgdb support for a long time. Just none of the three which I use :(" As to getting kgdb for i386 into the kernel, he explained some reluctance:
"I wouldn't support inclusion of i386 kgdb until it has had a lot of cleanup, possible de-featuritisification and some thought has been applied to splitting it into arch and generic bits. It's quite a lot of work."
It was quickly pointed out that Amit Kale has done much of this work with his version of kgdb, available here. Andrew replied, "Look, there's a lot of interest in this and I of course am fully supportive. If someone could send me Amit's patchset when they think I should test it, I could then talk about it more usefully." Read on for much of the lkml thread, including specifics reasons why and why not to include kgdb in the stock 2.6 kernel.
Andrew Morton [interview] has released 2.6.2-rc3-mm1, including a new debug patch to detect when a process calls i_size_write() without holding the inode's i_sem. Andrew explains, "It generates a warning and a stack backtrace. We know that XFS generates such a trace. It will turn itself off after the first ten warnings. Please don't report the XFS case." Also appearing in this kernel is Rusty Russell's CPU hotplug code, recently discussed on the lkml. It is pointed out that 2.6.2-rc3-mm1 is broken on the ppc64 architecture, "something to do with the sched-domains patch although at this stage we do not know whether the problem lies with that patch or with the ppc64 code."
The desire to merge reiser4 [story] into the -mm kernel was again raised. Andrew responded favorably enough, requesting the necessary patches and complete documentation. He does caution, "be aware that the barriers for a new filesystem are relatively high: each one adds a significant maintenance burden to the VFS and MM developers. It will need cautious review." This comment is evidently in reference to 2.6 inclusion, not -mm inclusion, as he goes on to add, "but that doesn't mean we cannot get it out there, get you some more testing and exposure."
A recent posting to the lkml suggested that the udev project has unfairly hijacked the devfs project, leading into yet another lengthy discussion comparing udev to devfs, and questioning why the latter has been deprecated. Linux devfs was written by Richard Gooch and merged into the 2.3.46 kernel in February of 2000. Since that time, Richard has stopped maintaining it, though a number of issues remain. During the 2.5 release cycle others such as Andrey Borzenkov have contributed fixes, though problems evidently remain with the actual design.
As early as 2001, Greg Kroah-Hartman began developing udev, working to implement the same functionality as devfs, but in userspace. Currently at version 010 [story], though not complete, udev is quite functional. For a good understanding of how it works, refer to this pdf from Greg's 2003 OLS talk. During the recent lkml discussion, 2.6 maintainer Andrew Morton acknowledged that though it has "architectural/cleanliness issues [...] devfs shall remain in 2.6 and shall continue to be supported." He went on to explain:
"Nor would I recommend that devfs be removed early from 2.7.x. We should wait until the proposed udev/sysfs solutions have matured in 2.6 and have proven themselves in the field. Only then will we be in a position to confirm that devfs can be removed without causing some people unacceptable levels of grief. There is no rush."
A brief thread on the lkml discussed whether or not Reiser4 would soon be stable enough to be merged into the 2.6 kernel as an 'experimental filesystem'. When it was suggested that this might be overly optimistic, that the filesystem may best go into the 2.7 development kernel [forum] first, Hans Reiser disagreed, "I don't think it is vastly optimistic, I hope we can send something in next month". He went on to explain, "we will have something we think is appropriate for inclusion as an experimental feature very soon now. Because our test scripts have become much more sophisticated, it means more when we say we cannot crash it, and it will go from experimental to stable faster than V3 did. I won't predict how fast."
Jens Axboe, maintainer of the block layer and several CD-ROM drivers, suggested that it would be unwise to merge the code so quickly, instead preferring a much lengthier period of user testing. He explains, "I don't doubt you have great testing scripts, but nothing beats real life testing." During a discussion in late August [story], 2.6 kernel maintainer Andrew Morton [interview] indicated that he would be willing to merge Reiser4 into his -mm patchset [howto].
In a couple of earlier articles, we walked through the process of upgrading to the 2.6.0-test4 kernel [story], and then using a small patch to upgrade to the 2.6.0-test5 kernel [story]. Today we'll continue our patching efforts to upgrade to an even faster feeling and more stable kernel with Andrew Morton's [interview] -mm patchset [forum].
Andrew Morton began releasing his -mm kernel patches a little over a year ago, in the summer of 2002. The -mm tree began as a 90k patch against the 2.5.17 development kernel, merging in the remote kernel debugger, kgdb. By the release of 2.5.18, the -mm patchset had grown to nearly 238k, merging in a wide assortment of fixes and new functionality. As of this writing, the current -mm patchset is 2.6.0-test5-mm3, weighing in at nearly 5 megabytes. Andrew's -mm tree has evolved from a testing ground for numerous new technologies, to a comprehensive patchset that is usually more stable than the mainline 2.6.0-test kernel itself. This bodes well for the future of the 2.6 kernel, as Andrew Morton will soon be the official 2.6 kernel maintainer.
There are numerous reasons you may desire trying Andrew's -mm kernel tree. Stability alone is a good incentive, and scanning the lengthy changelog you'll find a significant number of bug fixes that have been applied. I asked Andrew how the stability of his kernel compares to that of the mainline 2.6.0-test kernel, and he replied that though occasionally new bugs creep in, due to having the latest fixes the -mm tree is generally more stable and up-to-date.
Mark Wong posted a series of benchmark results from Rusty Russell's Hackbench. Rusty describes Hackbench as a minimized 'chat benchmark' that doesn't use threads or semaphores. The benchmark launches groups of processes that each listen on a given socket, and complimentary groups of processes that write 100 messages to each of the listening sockets, measuring the time this takes. This process is repeated multiple times with an increasing number of groups of processes, therby measuring the scalability of the scheduler with an increasing number of processes.
Mark's results begin with the 2.5.28 development kernel and continue up through the current 2.6.0-test5 kernel. In a second email he also offers results of the -mm tree, beginning with 2.5.66-mm1 and continuing up through 2.6.0-test5-mm2. Andrew Morton [interview] glanced at the results and commented that they looked "great, but tragically incomprehensible", going on to ask for an explanation, "do we rock or do we suck?". Mark replied, "the general trend in the metric indicates everything has been improving, so I think we rock."
Following Andrew Morton's [interview] recent posting of 2.6.0-test4-mm2 [forum], Christian Axelsson asked, "Is there any work [being] done on getting reiser4 into mm? I havent tried it myself yet but I've heard of colliding code in [the] scheduler". Andrew replied that a merging effort hasn't been made, but that he'd be interested in making it happen in a month or two so long as the namesys developers were willing to commit to providing him with up-to-date patches. Hans Reiser offered:
"We would be happy to make that commitment, and happy to switch from creating snapshots every week to pushing to you and linking to you from our website. Several people have asked for this besides Christian."
In other words, it looks like -mm users will soon have easy access to the resier4 filesystem.
Grant Miner posted some interesting benchmark results to the lkml, comparing five journaling filesystems available with the current 2.6.0-test2 development kernel. The tests were conducted with a very simple shell script, mainly timing how long it takes to copy, tar, and remove directories, performing several syncs in between. He summarizes:
- ext3's syncs tended to take the longest [at] 10 seconds, except
- JFS took a whopping 38.18s on its final sync
- xfs used more CPU than ext3 but was slower than ext3
- reiser4 had highest throughput and most CPU usage
- jfs had lowest throughput and least CPU usage
Some interesting discussion follows, debating the results and offering further suggestions on making the tests more useful. For example, Andrew Morton [interview] proposed including ext2 in the tests as a baseline, and Hans Reiser noted that reiser4 continues to improve rapidly. Read on for the full test results and much of the following discussion.
Andrew Morton [interview] posted on the lkml, "In 2.4.20-pre5 an optimisation was made to the ext3 fsync function which can very easily cause file data corruption at unmount time". This bug only affects people using ext3 in the uncommon "data=journal" mode, or files operating under "chattr -j", and does not affect the 2.5 series of kernels.
Andrew went on to say that "The symptoms are that any file data which was written within the thirty seconds prior to the unmount may not make it to disk. A workaround is to run `sync' before unmounting". He also posted a patch to fix the problem. However, soon thereafter, he posted saying that "that 'fix' didn't fix it. Sorry about that". Until a proper fix can be developed, he recommends that people "please avoid ext3/data=journal". Since "data=journal" is not the default ext3 mode, it is unlikely most people running ext3 will be affected by this. However, it is a data corruption bug so you should double-check that you use either "data=ordered" or "data=writeback" as your ext3 mode of operation.