"I've never looked at the Reiser code though the comments I get from friends who use it are on the order of 'extremely reliable but not the fastest filesystem in the world'," Matt Dillon explained when asked to compare his new clustering HAMMER filesystem with ReiserFS, both of which utilize BTrees to organize objects and records. He continued, "I don't expect HAMMER to be slow. A B-Tree typically uses a fairly small radix in the 8-64 range (HAMMER uses 8 for now). A standard indirect block methodology typically uses a much larger radix, such as 512, but is only able to organize information in a very restricted, linear way." He continued to describe numerous plans he has for optimizing performance, "my expectation is that this will lead to a fairly fast filesystem. We will know in about a month :-)"
Among the optimizations planned, Matt explained, "the main thing you want to do is to issue large I/Os which cover multiple B-Tree nodes and then arrange the physical layout of the B-Tree such that a linear I/O will cover the most likely path(s), thus reducing the actual number of physical I/O's needed." He noted, "HAMMER will also be able to issue 100% asynchronous I/Os for all B-Tree operations, because it doesn't need an intact B-Tree for recovery of the filesystem." He went on to describe another potential optimization allowed by the filesystem's design, "HAMMER is designed to allow clusters-by-cluster reoptimization of the storage layout. Anything that isn't optimally layed-out at the time it was created can be re-layed-out at some later time, e.g. with a continuously running background process or a nightly cron job or something of that ilk. This will allow HAMMER to choose to use an expedient layout instead of an optimal one in its critical path and then 'fix' the layout later on to make re-accesses optimal."
From: Chris Turner <c.turner@...>
Subject: Re: HAMMER filesystem update - design document
Date: Oct 12, 6:06 pm 2007
Matthew Dillon wrote:
>
> It will be cluster-by-cluster to begin with. I don't expect it to cause
> any issues, the BxTree in each cluster will be fairly compact and well
> cached and, most importantly, nearly all write I/O can be asynchronous
> so locks simply will not be held all that long.
>
> Eventually it will be possible to use inherent buffer cache locks to
> lock the BxTree operations but its a little dicey to try to do
> that level of fine-grained locking by default due to the allocation
> model.
>
Anyone up on ReiserFS ?
(but still capable of a 'clean room' description :)
As I recall, according to their docs it seems to have been one of the
first to use BTrees in the general sense for internal structuring ..
also as I recall, there were some performance problems in specific areas
of requiring extra CPU for basic IO (due to having to compute tree
operations rather than do simple pointer manipulations) and also
concurrent IO (due to the need for complex tree locking types of things,
possibly compounded by the extra cpu time)
this kind of a thing is more for replication than 100% raw speed, but
in any case.. just some topics for discussion I suppose ..
I still need to read the more detailed commits.
looking forward to it.
- Chris
From: Matthew Dillon <dillon@...>
Subject: Re: HAMMER filesystem update - design document
Date: Oct 13, 8:59 pm 2007
:Anyone up on ReiserFS ?
:
:(but still capable of a 'clean room' description :)
:...
:As I recall, according to their docs it seems to have been one of the
:first to use BTrees in the general sense for internal structuring ..
:
:also as I recall, there were some performance problems in specific areas
:of requiring extra CPU for basic IO (due to having to compute tree
:operations rather than do simple pointer manipulations) and also
:concurrent IO (due to the need for complex tree locking types of things,
:possibly compounded by the extra cpu time)
:
:this kind of a thing is more for replication than 100% raw speed, but
:in any case.. just some topics for discussion I suppose ..
:
:I still need to read the more detailed commits.
:
:looking forward to it.
:
:- Chris
I've never looked at the Reiser code though the comments I get from
friends who use it are on the order of 'extremely reliable but not
the fastest filesystem in the world'.
I don't expect HAMMER to be slow. A B-Tree typically uses a fairly
small radix in the 8-64 range (HAMMER uses 8 for now). A standard
indirect block methodology typically uses a much larger radix, such
as 512, but is only able to organize information in a very restricted,
linear way.
The are several tricks to making a B-Tree operate efficiently but
the main thing you want to do is to issue large I/Os which cover
multiple B-Tree nodes and then arrange the physical layout of the B-Tree
such that a linear I/O will cover the most likely path(s), thus
reducing the actual number of physical I/O's needed. Locality of
reference is important.
HAMMER will also be able to issue 100% asynchronous I/Os for all B-Tree
operations, because it doesn't need an intact B-Tree for recovery of the
filesystem. It can reconstruct the B-Tree for a cluster by scanning
the records in the cluster and using a stored transaction id verses the
transaction id in the records to determine what can be restored and
what still may have had pending asynchronous I/O and thus cannot.
HAMMER will implement one B-Tree per cluster (where a cluster is e.g.
64MB), and then hook clusters together at B-Tree leaf nodes (B-Tree
leaf -> root of some other cluster). This means that HAMMER will be
able to lock modifying operations cluster-by-cluster at the very
least and hopefully greatly improve the amount of parallelism supported
by the filesystem.
HAMMER uses a index-record-data approach. Each cluster has three types
of information in it: Indexes, records, and data. The index is a B-Tree
and B-Tree nodes will replicate most of the contents of the records as
well as supply a direct pointer to the related data. B-Tree nodes will
be localized in typed filesystem buffers (that is, grouped with other
B-Tree nodes), and B-Tree filesystem buffers will be intermixed with
data filesystem buffers to a degree, so it should have extremely
good caching characteristics. I tried to take into consideration how
hard drives cache data (which is typically whole tracks to begin with)
and incorporate that into the design.
Finally, HAMMER is designed to allow clusters-by-cluster reoptimization
of the storage layout. Anything that isn't optimally layed-out at the
time it was created can be re-layed-out at some later time, e.g. with
a continuously running background process or a nightly cron job or
something of that ilk. This will allow HAMMER to choose to use an
expedient layout instead of an optimal one in its critical path and then
'fix' the layout later on to make re-accesses optimal. I've left a ton
of bytes free in the filesystem buffer headers for records and clusters
for (future) usage-pattern tracking heuristics.
The radix tree bitmap allocator, which has been committed so you can
take a look at it if you want, is extremely sophisticated. It should
be able to optimally allocate and free various types of information
all the way from megabyte+ sized chunks down to a 64-byte boundary,
in powers of 2.
My expectation is that this will lead to a fairly fast filesystem. We
will know in about a month :-)
-Matt
Matthew Dillon
<dillon@backplane.com>
I've never looked at the
I've never looked at the Reiser code though the comments I get from friends who use it are on the order of 'extremely reliable but not the fastest filesystem in the world'.
Either he typed that out wrong, or I suspect his friends are having a laugh.
I'm pretty sure that if you asked a hundred Linux users about ReiserFS, at least 99 of them would describe it as "extremely fast but not the most reliable filesystem in the World".
Now, I never had any reliability problems with ReiserFS when I was using it, and I suspect that most of the people who have done are either exaggerating or shot themselves in the foot due to inexperience. But whatever the truth of the matter, there is no way on Earth you can say that ReiserFS has a reputation of being very reliable but slow. Exactly the opposite in fact.
I've been using ReiserFS
I've been using ReiserFS (version 3.6) on my current laptop, for the last few years. I've never had an issue with it.
I had an older laptop with Reiser 3.5, which managed to put a file in such a state, that I could not delete it. I had to use a CVS version of reiserfsck to fix the issue. That was about 5 years ago. From then on, it's been working flawlessly.
I'm quite happy with the performance and reliability of ReiserFS. It'll be interesting to see how version 4 shapes up.
A laptop isn't generally the
A laptop isn't generally the best test environment, a server that supports millions of users gives much better view of stability.
What makes ReiserFS
What makes ReiserFS seemingly unreliable is that it can do bad things in BAD situations. However, if you have a normal stable environment, it's fine. The unusual organization of the data on a drive makes is highly susceptible to corruption when things like bad sectors occur. However, if your drives are in good shape, no problem.
ReiserFS suffered from Reiser attitude
I used ReiserFS on a home server a few months after its introduction in Kernel and distro. I got several corruption (about 4/5) in less than three months.
I was hit by two problems: first was a race condition on ReiserFS open() that was identified only one year later, second was a few bad sectors on the disk.
What was disgusting, was Reiser behavior to issue reporting. He always denied there are problems in ReiserFS. It was always bad hardware for him.
He was partially right in my case. But seeing again corruption on new healthy hardware, I lost trust in ReiserFS (and Reiser too). I got tired of spending hours on fsck too.
I doubt I will try again ReiserFS on my hardware.
hello! it depends one one's
hello!
it depends one one's experience; i've been using reiserfs on a lot of systems (some with bad disks) and didn't see a single failure, corruption. At the same time i suffered some issues with ext3 (that was 4years ago).
After 2 years of distrust about it, i tried it again and i'm now a happy ext3 user (and reiserfs too)