login
Header Space

 
 

HAMMER's B+Tree Implementation

June 17, 2008 - 10:52pm
Submitted by Jeremy on June 17, 2008 - 10:52pm.
DragonFlyBSD

"HAMMER makes no modifications to the B-Tree whatsoever on the front-end. When you create, delete, rename, write, etc... when you do those operations HAMMER caches them in a virtualization layer in memory and doesn't make any modifications to its on-media data structures (or their in-memory representations) at all until the meta-data is synced to disk," DragonFly BSD creator Matthew Dillon explained, comparing HAMMER, his clustering filesystem, to a wiki summary of Reiser4's implementations. He continued:

"HAMMER uses a modified B+Tree for its on-disk representation, which is a B-Tree with only keys at internal nodes and only records at the leafs. This was done to reduce structural bloat, allow for a leaf->leaf linking optimization in the future, and for other reasons. [...] HAMMER's internal nodes have a left and right bounding element. A standard B+Tree only has a left bounding element. By adding a right bounding element HAMMER can cache pointers into its B+Tree and 'pick up' searches, insertions, and deletions relative to the cached pointers instead of having to start at the root of the tree. More importantly, it can pickup searches, insertions, and deletions at internal nodes, not just leaf nodes. So I can cache a proximity pointer and if I do a good job I never have to traverse the B+Tree above that point."


From: Matthew Dillon <dillon@...>
Subject: HAMMER Update 13-June-2008 - HEADS UP: MEDIA CHANGED AGAIN
Date: Jun 13, 10:01 pm 2008

Commit 55 to the filesystem and utilities changes the media format
    again.  A new blockmap has been added as a stop-gap while I 
    research storage model issues more.  Please note:

	* You must recompile the filesystem AND its utilities (hammer and
	  newfs_hammer).

	* You must re-newfs_hammer your test filesystems.

    This commit also fixes a panic and *might* also have fixed the reported
    machine lockup.  It does not fix the memory exhaustion issue yet.

    There is a known issue (that's been around for a while) where the
    HAMMER flusher threads can become cpu bound, creating large noticeable
    pauses in machine operation.  That is related to B-Tree overheads and
    has not yet been dealt with.

    I have made progress on the blogbench performance drop issue when run
    on the same directory over and over again.  Blogbench does its best
    to really mess up B-Trees :-).  I have narrowed the issue down to
    directory/namespace lookups in the face of massive directory rewriting.
    blogbench does that in spades by creating .tmp files and renaming them
    over the originals.

    Unfortunately, the only way to really fix it is to introduce true
    locality of reference on disk on a file-by-file and directory-by-
    directory basis, kinda like how UFS does with cylinder groups.
    In HAMMER's case this means adding a field to the blockmap's
    layer2 structure (which represents an 8MB chunk of disk space) to
    allow incremental appends anywhere on the disk.  This is probably what
    I will be working on for the rest of this month.

						-Matt
						Matthew Dillon 
						<dillon@backplane.com>

From: strangepics <strangepics@...> Subject: Re: HAMMER Update 13-June-2008 - HEADS UP: MEDIA CHANGED AGAIN Date: Jun 15, 2:22 am 2008 Matthew Dillon wrote: > on the same directory over and over again. Blogbench does its best > to really mess up B-Trees :-). I have narrowed the issue down to That's probably why they don't use B-trees for ReiserFS. What about crit-bit? Other trees?
From: Matthew Dillon <dillon@...> Subject: Re: HAMMER Update 13-June-2008 - HEADS UP: MEDIA CHANGED AGAIN Date: Jun 15, 10:28 pm 2008 : :Matthew Dillon wrote: :> on the same directory over and over again. Blogbench does its best :> to really mess up B-Trees :-). I have narrowed the issue down to : :That's probably why they don't use B-trees for ReiserFS. : :What about crit-bit? Other trees? I am not too familiar with Reiser so I can't really come to that conclusion, but it doesn't seem likely that the reasons are similar. HAMMER's issue insofar as the B-Tree goes is mainly due to its history retention practices. If I mount with -o nohistory then the issue becomes one of locality of reference due to HAMMER not immediately reusing space freed by the rename-over that blogbench does. After running the test over the weekend the culprit seems to pointing more towards HAMMER's low level storage allocation model and away from the B-Tree per-say. I was already planning on making some major changes there so we'll see what pans out. -Matt Matthew Dillon <dillon@backplane.com>
From: Dan M <strangepics@...> Subject: Re: HAMMER Update 13-June-2008 - HEADS UP: MEDIA CHANGED AGAIN Date: Jun 16, 4:18 pm 2008 On Sun, Jun 15, 2008 at 10:28 PM, Matthew Dillon <dillon@apollo.backplane.com> wrote: > I am not too familiar with Reiser so I can't really come to that > conclusion, but it doesn't seem likely that the reasons are similar. They used B+ in earlier versions. In Reiser4 they decided to use http://en.wikipedia.org/wiki/B*-tree and http://en.wikipedia.org/wiki/Dancing_trees -- Dan
From: Matthew Dillon <dillon@...> Subject: Re: HAMMER Update 13-June-2008 - HEADS UP: MEDIA CHANGED AGAIN Date: Jun 16, 5:37 pm 2008 :<dillon@apollo.backplane.com> wrote: :> I am not too familiar with Reiser so I can't really come to that :> conclusion, but it doesn't seem likely that the reasons are similar. : :They used B+ in earlier versions. In Reiser4 they decided to use : :http://en.wikipedia.org/wiki/B*-tree : :and : :http://en.wikipedia.org/wiki/Dancing_trees : :-- :Dan I can never keep all the variations straight, I need a Wiki reference every time to get the name right :-). These are all minor variations of B+Trees. The B* variation keeps nodes 2/3rds full instead of half full. It is an optimization which is intended to maintain higher fill ratios to make better use of system caches. A Dancing tree ... hehehehe. Looks like Hans invented that one. I'm not sure it even counts as a variation of a B+Tree. It doesn't apply to HAMMER at all because HAMMER makes no modifications to the B-Tree whatsoever on the front-end. When you create, delete, rename, write, etc... when you do those operations HAMMER caches them in a virtualization layer in memory and doesn't make any modifications to its on-media data structures (or their in-memory representations) at all until the meta-data is synced to disk. HAMMER doesn't have to make real-time on-media modifications for any meta-data change for ANY operation. Not a single one. Not rename, not hard or soft links, not truncations, nothing. HAMMER uses a modified B+Tree for its on-disk representation, which is a B-Tree with only keys at internal nodes and only records at the leafs. This was done to reduce structural bloat, allow for a leaf->leaf linking optimization in the future, and for other reasons. The Wiki definition isn't quite right. The leaf->leaf links are optional (and HAMMER doesn't implement them, though I've left room to do so). It's still a B+Tree if you don't have them. HAMMER's custom modification is to formalize both a left AND a right bounding element. So a HAMMER B+Tree looks like this: B B B B B B <---- internal node (e.g. 6 elements w/ 5 leaf ptrs) L L L L L <---- leaf nodes Whereas a normal B+Tree looks like this (note the missing right bound): B B B B B <---- internal node L L L L L <---- leaf nodes Note that HAMMER uses a radix of 64, so e.g. 64 internal elements but one is a right-bound so only 63 pointers to the next level. The leafs then have 64 elements. So HAMMER's B+Tree recursion is 63x63x63....x64. HAMMER's internal nodes have a left and right bounding element. A standard B+Tree only has a left bounding element. By adding a right bounding element HAMMER can cache pointers into its B+Tree and 'pick up' searches, insertions, and deletions relative to the cached pointers instead of having to start at the root of the tree. More importantly, it can pickup searches, insertions, and deletions at internal nodes, not just leaf nodes. So I can cache a proximity pointer and if I do a good job I never have to traverse the B+Tree above that point. Without the right bound a search cannot be picked up from mid-tree without potentially going down the wrong branch and then having to backtrack. -- Many of the textbook optimizations listed on e.g. Wiki have serious issues... you can't make the blanket statement "method A is better then method B". Balancing optimizations can cause massive issues with deadlocks, for example, or a need to lock extra nodes, or a need to traverse a greater portion of the tree. In fact, for that reason, a lot of people don't even try to balance B-Tree fill ratios in the critical path, and neither does HAMMER. Putting in better balancing and fill-ratio algorithms are something I'll probably do in the pruner or reblocker at some point, but never in the critical path. I've left the implementation open to additional optimizations, such as the leaf-linking optimization and better balancing optimizations such as you get with B*. There are plenty of others as well, and in fact WHEN you do the optimizations is often more important then WHAT optimizations you choose to do. HAMMER does a bunch of other stuff as well. But HAMMER's main optimization is the addition of the right bounding element and its ability to cache pointers into the middle of the B+Tree to short-cut searches. -Matt Matthew Dillon <dillon@backplane.com>

From: Matthew Dillon <dillon@...>
Subject: HAMMER Update 16-June-2008 - HEADS UP: MEDIA CHANGED AGAIN (2 of 4)
Date: Jun 17, 4:53 am 2008

NOTE!  This week (June 16-20 2008) there are going to be several commits
    each of which change the on-media structure and will require recompilation
    of the HAMMER utilities, kernel, and for test filesystems to be
    newfs_hammer'd.

    This is the second of possibly three or four media changes.  I hope
    to get them all done this week, and then that will be it for media
    changes before the 2.0 release.

    Please note that this week testers should expect some instability due
    to the complexity of the media changes, but I don't expect it to last
    beyond the end of the week.

				    KNOWN BUGS

    * There is currently one known failure case that the FSX test finds, which
      I hope to address this week.  It is related to the BMAP code.

    * In-memory record allocations can still blow out system memory and 
      cause a kmalloc panic.

    * bawrite() sometimes panics due to a race (should get this fixed
      tuesday).

			       REMAINDER OF THIS WEEK

    For the rest of this week I will be working on bugs and the remaining
    media changes that I wanted to get in.  In particular going to a larger
    block size (and probably implementing clustered writes), and tuning
    the low level storage manager.

    I believe that going to the larger blocksize will significantly improve
    performance as 64K and 128K single-record writes will cut the B-Tree
    overhead down by 200-800% verses the 16K single-record writes HAMMER
    does now.  This should also, coupled with some work on the low level
    allocator, greatly improve random read performance in the blogbench test.

					    -Matt
					    Matthew Dillon 
					    <dillon@backplane.com>

From: Matthew Dillon <dillon@...>
Subject: HAMMER Update 17-June-2008 - HEADS UP: MEDIA CHANGED AGAIN (2.5 of 4)
Date: Jun 17, 9:54 pm 2008

More media changes have gone in.  Not the one I intended to put in
    today so there are still two left.

    Testers need to rebuild kernel / utilities and a newfs_hammer is
    needed (and will be throughout this week).  Please expect that the
    commits going in this week will require newfs'ing for testing purposes.

    56B Fixes numerous performance issues, including a very nasty 25uS crc
    calculation that was being made on every B-Tree node access and every
    B-Tree node modification.  That calculation is now only being made
    on the initial buffer load into the cache and when the modified nodes
    are written out.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>


speck-geostationary