"HAMMER makes no modifications to the B-Tree whatsoever on the front-end. When you create, delete, rename, write, etc... when you do those operations HAMMER caches them in a virtualization layer in memory and doesn't make any modifications to its on-media data structures (or their in-memory representations) at all until the meta-data is synced to disk," DragonFly BSD creator Matthew Dillon explained, comparing HAMMER, his clustering filesystem, to a wiki summary of Reiser4's implementations. He continued:
"HAMMER uses a modified B+Tree for its on-disk representation, which is a B-Tree with only keys at internal nodes and only records at the leafs. This was done to reduce structural bloat, allow for a leaf->leaf linking optimization in the future, and for other reasons. [...] HAMMER's internal nodes have a left and right bounding element. A standard B+Tree only has a left bounding element. By adding a right bounding element HAMMER can cache pointers into its B+Tree and 'pick up' searches, insertions, and deletions relative to the cached pointers instead of having to start at the root of the tree. More importantly, it can pickup searches, insertions, and deletions at internal nodes, not just leaf nodes. So I can cache a proximity pointer and if I do a good job I never have to traverse the B+Tree above that point."
Jaroslav Sykora posted a series of five patches to handle the kernel portion of what he described as "shadow directories", providing an example which utilized FUSE to access the contents of a compressed file from the command line. His first example was cat hello.zip^/hello.c about which he explained, "the '^' is an escape character and it tells the computer to treat the file as a directory. The kernel patch implements only a redirection of the request to another directory('shadow directory') where a FUSE server must be mounted. The decompression of archives is entirely handled in the user space. More info can be found in the documentation patch in the series."
There were numerous problems suggested. Jan Engelhardt noted, "too bad, since ^ is a valid character in a *file*name. Everything is, with the exception of '\0' and '/'. At the end of the day, there are no control characters you could use." Later in the thread an lwn.net article from a couple years ago was quoted, "another branch, led by Al Viro, worries about the locking considerations of this whole scheme. Linux, like most Unix systems, has never allowed hard links to directories for a number of reasons;" The article had been discussing Reiser4, which treats files as directories. In the current discussion, Al Viro added, "as for the posted patch, AFAICS it's FUBAR in handling of .. in such directories. Moreover, how are you going to keep that shadow tree in sync with the main one if somebody starts doing renames in the latter? Or mount --move, or..."
The future of Reiser4 was raised on the lkml, with the filesystem's creator, Hans Reiser [interview], awaiting his May 7'th trial [story]. Concerns that the filesystem wasn't being maintained were laid to rest when Andrew Morton [interview] stated, "the namesys engineers continue to maintain reiser4 and I continue to receive patches for it." He further added, "the namesys guys are responsive and play well with others." As to why the filesystem hasn't yet been merged into the 2.6 kernel, Andrew explained, "to get it unstuck we'd need a general push, get people looking at and testing the code, get the vendors to have a serious think about it, etc. We could do that - it'd require that the namesys people (and I) start making threatening noises about merging it, I guess." He then made joking reference to the recent debate regarding the new CPU schedulers [story], "or we could move all the reiser4 code into kernel/sched.c - that seems to get people fired up."
Namesys developer and author of the Reiser4 encryption and compression plugins, Edward Shiskin, offered some updates. Replying to some comments about the need to remove plugins from the Reiser4 code he explained, "the popular opinion that plugins make more sense in the VFS is a great delusion, as plugins are entities related to reiser4 disk layouts." In an earlier thread it had been suggested that the plugins were misnamed and would be better called an internal abstraction layer [story]. Edward went on to note, "currently there are two namesys employees working [on Reiser4] mostly on enthusiasm." He linked to a wiki page listing known issues with the code needing to be fixed before it's likely to be merged into the 2.6 kernel, "the main issues here are xattrs and support for blocksize != pagesize. I think that adding xattrs will take ~1 month of full-time working. Not sure about blocksize support." When it was noted that other filesystems have already been merged without support for either of these features, Edward said that they'd lower their priority and finish up with the other remaining issues left on the old todo list and resume the merge discussion at that time.
With Namesys founder Hans Reiser [interview] recently arrested as the prime suspect in the disappearance of his estranged wife, a brief thread on the lkml discussed the future of ReiserFS. Alan Cox [interview] pointed out that, "reiserfs is written by a team of people at Namesys, and particularly with reiserfs3 people at SuSE and elsewhere as well."
Alexander Lyamin, listed on the Namesys website as their "hostmaster and sysadmin", noted that the team was "rather shaken and stressed at the moment". He confirmed that ReiserFS 3.6 is currently in maintenance mode, then continued to discuss Reiser4, "we are still going through revisions, thanks to [Andrew Morton]. Chunking out patches, fixing issues and generally cleaning the house." He explained that this was the short term plan, for at least the next 6 months. Regarding the future he noted it depends on the outcome of the trial, "if it goes [the] way we hope it will go. Well... We will do fine. If it goes bad. That is where it becomes tricky. We will try to appoint a proxy to run Namesys business."
Andrew Morton [interview] posted his patch queue with numerous comments about merge plans into the mainline kernel. Among his comments he noted that he would not yet be merging the Reiser4 filesystem [story], "reiser4. I was planning on merging this, but the batch_write/writev problemight wreck things, and I don't think the patches arising from my recent partial review have come through yet. So it's looking more like 2.6.20."
A large discussion followed Andrew's posting that focused on the current kernel development process [story]. Andrew expressed his concerns on what's currently happening, "people seem to treat the stabilisation period as a wonderful quiet time in which to run off and develop new features, rather than participating in the stabilisation. This has the following effects: 1: release cycles get longer 2: the kernel has more bugs 3: we put new features into the kernel faster than we otherwise would (see 2:, above)." Alan Cox [interview] proposed an idea, "a suggestion from the department of evil ideas: Call even cycles development odd ones stabilizing. Nothing gets into an odd one without a review and linux-kernel signoff/ack?" Linus Torvalds replied favorably, going on to note that he was surprised at how well the decision to only accept big merges in the two weeks following a major release has been accepted, "I actually expected people to dislike arbitrary rules more than they do, but I've come to believe that people _like_ having rules that they have to obey, as long as it's not a big pain for them. In other words, arbitrary rules are not actually disliked at all, people actually _like_ them, because suddenly there's less need for making unnecessary judgement decisions." Linus went on to spell out the idea further, "2.6.<odd> is 'the big initial merges with all the obvious fixes to make it all work' (ie roughly the current -rc2 or perhaps -rc3). 2.6.<even> is 'no big merges, just careful fixes' (ie the current 'real release')". He went on to caution:
"That said, I think Andrew was of the opinion that it doesn't really _fix_ anything, and he may well be right. What's the point of the odd release, if the weekly snapshots after that are supposed to be strictly better than it anyway? So I think I may like it just because it _seems_ to combine the good features of both the old naming scheme and the current one, but I suspect Andrew may be right in that it doesn't _really_ change anything, deep down."
The ongoing discussion about the Reiser4 filesystem [story] continues on the lkml. Jeff Garzik discussed the complexity introduced by a plugin layer [story], suggesting it is really a second VFS, "furthermore, it completely changes the notion of what a Linux filesystem is. Currently, each Linux filesystem is a tightly constrained set of metadata support. reiser4 changes 'tightly constrained' to 'infinity'. While that freedom is certainly liberating, it also has obvious support costs due to new admin paradigms and customer configuration possibilities."
Linux creator Linus Torvalds weighed in on the discussion, "as long you call them 'plugins' and treat them as such, I (and I suspect a lot of other people) are totally uninterested, and in fact, a lot of people will suspect that the primary aim is to either subvert the kernel copyright rules, or at best to create a mess of incompatible semantics with no sane overlying rules for locking etc." He went on to add, "as far as I'm concerned, the problem with reiser4 is that it hasn't tried to work with the VFS people. Now, I realize that the main VFS people aren't always easy to work with (Al and Christoph, take a bow), but that doesn't really change the basic facts. Al in particular is _always_ right. I don't think I've ever had the cojones to argue with Al.."
Later in the same thread, Andrew Morton [interview] noted that he's currently reviewing the code, "meanwhile here's poor old me trying to find another four hours to finish reviewing the thing." Regarding the code he added, "the writeout code is ugly, although that's largely due to a mismatch between what reiser4 wants to do and what the VFS/MM expects it to do. If it works, we can live with it, although perhaps the VFS could be made smarter." He then suggested, "I'd say that resier4's major problem is the lack of xattrs, acls and direct-io. That's likely to significantly limit its vendor uptake." As for the plugin debate, Andrew said, "the plugins appear to be wildly misnamed - they're just an internal abstraction layer which permits later feature additions to be added in a clean and safe manner. Certainly not worth all this fuss."
The discussion about why the Reiser4 filesystem has not been merged into the Linux kernel [story] continues on the lkml. Hans Reiser [interview] contrasted the struggles Reiser4 has had trying to get merged versus recent discussion about the up and coming ext4 filesystem [story], "the code isn't even written, benchmarked, or tested yet, and it is going into the kernel already so that its developers don't have to deal with maintaining patches separate from the tree. Wow. Kind of hard to argue that it is not politically differentiated, isn't it?"
Theodore T'so responsed, "it is a development procedure that was developed after discussion and consensus building across LKML and the ext2/3/4 development team. It was not the original plan put forth by the ext2 developers, but after listening to the concerns and suggestions, we did not question the motives of the people making suggestions; we listened." He went on to note that parts of what will be ext4 were written a year ago, and have been heavily tested and reviewed. Others pointed out that the evolution between ext3 and ext4 will be a very public process, with patches being merged gradually, whereas Reiser4 is a completely different code base from Reiser3.
The latest chapter in this ongoing debate tends to be more about clashing personalities than the code in question. How this affects if and when the Reiser4 filesystem will be merged into the mainline Linux kernel is yet to be seen.
The question of if and when Reiser4 will be merged into the mainline Linux kernel has been an on-going debate for a couple of years [story]. The filesystem was described as being "fairly stable for average users" by Hans Reiser [interview] over two years ago, in March of 2004 [story]. It has been merged into Andrew Morton [interview]'s -mm kernel [story], though issues such as Reiser4 plugins [story] and coding style [story] caused lengthy discussions last year. Two recent threads on the lkml raised the question again, asking at a non-technical level why Reiser 4 has not been included in the Linux kernel. Some have offered theories that Reiser4 is being blocked for political reasons, others because of concerns that once Reiser4 is included Namesys might forget it and move onto another filesystem. Responses to these theories point out that in reality there are technical issues that must be resolved before the filesystem will be merged, and that much progress has been made toward this end. Additional discussion can be found on a relevant recently created kernel newbies wiki page.
Hans Reiser posted a "short term task list for Reiser4" to address the remaining technical issues. The todo list included getting batch_write merged into the -mm kernel [story], getting read optimization code merged into the -mm kernel, documenting everything in the Namesys wiki, exploring and addressing reports of system pauses when using Reiser4, a complete review of the crypt-compress code, a large effort in optimizing fsync, a review of installation instructions, and a review of the kernel documentation. Hans explains, "unfortunately, our code stability is going to decrease for a bit due to all these changes to the read and write code --- no way to cure that but passage of time. On the other hand, our CPU usage went way down. Reiser4's only performance weakness now is fsync. Once the crypt-compress code is ready, we will release Reiser4.1-beta (with plugins, releasing a beta means telling users that if they mount -o reiser4.1-beta then cryptcompress will be their default plugin, and if they don't, then they are using Reiser4.0 still). Doubling our performance and halving our disk usage is going to be fun."
Hans Reiser [interview] described a recently posted patch as, "it revises the existing reiser4 code to do a good job for writes that are larger than 4k at a time by assiduously adhering to the principle that things that need to be done once per write should be done once per write, not once per 4k." He went on to explain, "this code empirically proves that the generic code design which passes 4k at a time to the underlying FS can be improved. Performance results show that the new code consumes 40% less CPU when doing 'dd bs=1MB .....'" Referring to generic_file_write(), he further noted that currently when writing 64MB of data, "it may go to the kernel as a 64MB write, but VFS sends it to the FS as 64MB/4k separate 4k writes." It was acknowledged that this could also be accomplished in a non-generic way, howevever earlier feedback had suggested that such improvements should be made available to all.
Andrew Morton [interview] responded to the proposed changes saying, "there's nothing which leaps out and says 'wrong' in this. But there's nothing which leaps out and says 'right', either. It seems somewhat arbitrary, that's all." He pointed out that reiser4 was currently the only filesystem to benefit from the changes, "to be able to say 'yes, we want this' I think we'd need to understand which other filesystems would benefit from exploiting it, and with what results?" In the resulting discussion, it was determined that both FUSE [story] and XFS [story] would benefit from these changes prompting Hans to ask, "Is it enough?" Andrew agreed, "Spose so. Let's see what the diff looks like?"
Hans Reiser [interview] sent an email to the lkml titled, "I request inclusion of reiser4 in the mainline kernel". He provided a list of objections raised earlier, noting that all had been addressed. Among the listed issues, Reiser4 now works with 4k stacks. "There have been no bug reports concerning the new code," Hans added.
The request was followed with some suggestions by Christoph Hellwig, including general comments about the coding style. This was one of many issues that led to debate in which Hans commented, "most of my customers remark that Namesys code is head and shoulders above the rest of the kernel code. So yes, it is different." Alan Cox [interview] replied that while the kernel coding style isn't his own style, he tries to follow it when working on the kernel, "one big reason we jump up and down so much about the coding style is that its the one thing that ensures someone else can maintain and fix code that the author has abandoned, doesn't have time to fix or that needs access to specific hardware the authors may not have." Much of the rest of the thread was less friendly, leaving the question of merging Reiser4 into the mainline kernel still up in the air.
Hans Reiser formed Namesys and began the development of Reiserfs ten years ago. The first release of the filesystem, Reiser3, is part of the mainline 2.4 and 2.6 Linux kernels. The more recent Reiser4 is a complete redesign and reimplementation of Reiserfs, aiming to soon be merged into the mainline 2.6 Linux kernel.
In this interview, Hans discusses his background and how he came to create Namesys and Reiserfs. He looks back at Reiser3, describing the advantages it had over other filesystems when it was released and its current state. He then explores the many improvements currently in Reiser4, describing the plugin architecture and its exciting potential for future semantic enhancements.
Andrew Morton [interview] provided an update on the current development status of the Linux kernel. As of his announcement, the latest development release is 2.6.13-git5, with 2.6.14 expected around October 7'th. At this time, Andrew is tracking 144 bugs though he notes, "I haven't culled these yet - some may be fixed." Indeed, a number of replies indicated that several of the listed bugs have been fixed.
As for what will likely be merged in the next couple of weeks and be part of the upcoming 2.6.14 release, Andrew listed several filesystems including relayfs [story], v9fs [story], and FUSE [story]. Regarding the latter he noted that he was, "fed up with arguing - any remaining problems can be fixed up in-tree if anyone can think of how to fix them." As for much anticipated Reiser4, Andrew summarized, "Stuck. Last time we discussed this I asked the reiser4 team to develop and negotiate a bullet-point list of things to be addressed. Once that's agreed to, implement it and then we can merge it. None of that has happened and as far as I know, all the review feedback which was provided was lost."
In the debate following Andrew Morton [interview] posting his plans for 2.6.13 [story], the existence of a plugin layer in Reiser4 was discussed. Jeff Garzik put it blunty, "the plugin stuff is crap. This is not a filesystem but a filesystem new layer. IMO considered in that light, it duplicates functionality elsewhere." Andrew Morton went on to explain, "I think the concern here is that this is implemented at the wrong level. In Linux, a filesystem is some dumb thing which implements address_space_operations, filesystem_operations, etc."
Hans Reiser noted, "please remember that this is per file, per item, per node, per attribute, per disk format, per bitmap, per super block, etc., abstracting, not per filesystem abstracting." He explained a couple advantages to plugins being that it makes it much easier for developers to change the disk format, and allows for easy code reuse. He added, "the use of plugins forced all the programmers to think about reusability at every layer of design. V3 of reiserfs is way too hard to work on and modify. If you ask one of the team to code something for V3 instead of V4, they quietly groan at the thought. It is just so much easier to do in V4."
Andrew Morton replied, "advanced features such as those which you describe are implemented on top of the filesystem, not within it. reiser4 turns it all upside down. Now, some of the features which you envision are not amenable to above-the-fs implementations. But some will be, and that's where we should implement those." The lengthy discussion continued, an interesting read for Reiser4 supporters and detractors alike.
With the release of 2.6.9-mm1, Andrew Morton [interview] offered a quick status update on a number of patches in his -mm tree [forum] that are 2.6-mainline hopefuls. For example, regarding the much debated reiser4 filesystem [story], Andrew said that he is still "not sure, really. The namespace extensions were disabled, although all the code for that is still present. Linus's filesystem criterion used to be 'once lots of people are using it, preferably when vendors are shipping it'. That's a bit of a chicken and egg thing though. Needs more discussion". And as for Ingo Molnar [interview]'s preemption and low-latency fixups [forum] Andrew offered, "I haven't really thought about it and haven't looked at the patches yet. Hopefully 2.6.10 material."
Other projects specifically mentioned include the sysfs backing store, the ext3 reservations code, the ext3 resize code, kexec and crashdump [story], perfctr, cachefs, cpusets, and the md updates. Read on for Andrew's comments and the complete -mm1 changelog.
Continuing the debate over the right way to go about implementing some of the features found in the newly released Reiser4 [story], Hans Reiser asked Al Viro to clarify the issues he thinks could arise from the current implementation. The result was a brilliant explanation of what problems Al sees, specifically related to dentry aliasing, and how the current VFS architecture handles some of these problems.
Read on for Al's response and further clarification from Linus Torvalds. The interesting exchange provides some good insight into the Linux VFS layer.