"The big advantage Hammer has over Tux3 is, it is up and running and released in the Dragonfly distro," began Daniel Phillips, offering a comparison between the two filesystem. He continued, "the biggest disadvantage is, it runs on BSD, not Linux, and it so heavily implements functionality that is provided by the VFS and block layer in Linux that a port would be far from trivial. It will likely happen eventually, but probably in about the same timeframe that we can get Tux3 up and stable." This led into a lengthy and interesting technical discussion between Daniel and HAMMER author Matthew Dillon, comparing the design of the two filesystems.
Matthew reviewed the Tux3 notes and replied, "it sounds like Tux3 is using many similar ideas [as HAMMER]. I think you are on the right track. I will add one big note of caution, drawing from my experience implementing HAMMER, because I think you are going to hit a lot of the same issues. I spent 9 months designing HAMMER and 9 months implementing it. During the course of implementing it I wound up throwing away probably 80% of the original design outright." Daniel noted that he's been working on the Tux3 design for around ten years, "and working seriously on the simplifying elements for the last three years or so, either entirely on paper or in related work like ddsnap and LVM3." Matthew cautioned, "I can tell you've been thinking about Tux for a long time. If I had one worry about your proposed implementation it would be in the area of algorithmic complexity. You have to deal with the in-memory cache, the log, the B-Tree, plus secondary indexing for snapshotted elements and a ton of special cases all over the place. Your general lookup code is going to be very, very complex. My original design for HAMMER was a lot more complex (if you can believe it!) then the end result. A good chunk of what I had to do going from concept to reality was deflate a lot of that complexity." The friendly conversation offers a very detailed look at the design choices made in each of these file systems.
From: Pedro F. Giffuni Subject: [Tux3] Comparison to Hammer fs design Date: Jul 25, 9:22 am 2008 Hi; The announcement of yet another filesystem: http://lkml.org/lkml/2008/7/23/257 led to some comments about hammer fs: http://tux3.org/pipermail/tux3/2008-July/000006.html enjoy, Pedro.
From: Daniel Phillips Subject: [Tux3] Comparison to Hammer fs design Date: Thu Jul 24 13:26:27 PDT 2008 I read Matt Dillon's Hammer filesystem design with interest: http://apollo.backplane.com/DFlyMisc/hammer01.pdf Link kindly provided by pgquiles. The big advantage Hammer has over Tux3 is, it is up and running and released in the Dragonfly distro. The biggest disadvantage is, it runs on BSD, not Linux, and it so heavily implements functionality that is provided by the VFS and block layer in Linux that a port would be far from trivial. It will likely happen eventually, but probably in about the same timeframe that we can get Tux3 up and stable. Tux3 is a simpler design than Hammer as far as I can see, and stays closer to our traditional ideas of how a filesystem behaves, for example there is no requirement for a background process to be continuously running through the filesystem reblocking it to recover space. Though Tux3 does have the notion of followup metadata passes to "promote" logical forward log changes to physical changes to btree nodes etc. However this does not have to be a daemon, it can just be something that happens every so many write transactions in the process context that did the write. Avoiding daemons in filesystems is good - each one needs special attention to avoid deadlock, and they mess up the the ps list, a minor but esthetic consideration. Matt hit on a similar idea to versioned pointers, that is, his birth and death version numbers for disk records. So we both saw independently that recursive copy on write as in WAFL, ZFS and Btrfs is suboptimal. I found that only the birth version is actually required, simply because file data elements never actually die, they is only ever overwritten or truncated away. Therefore, a subsequent birth always implies the death of the previous data element, and only the birth version has to be stored, which I simply call the "version". Data element death by truncate is handled by the birth of a new (versioned) size attribute. Eventually Matt should realize that too, and rev Hammer to improve its storage efficiency. Oddly, Hammer only seems to support a linear chain of versions, whereas I have shown that with no increase in the size of metadata (except for the once-per-volume version tree) you can store writable versions with arbitrary parentage. I think Matt should take note of that too and incorporate it in Hammer. Some aspects of the Hammer seem quite inefficient, so I wonder what he means when he says it performs really well. In comparison to what? Well I don't have a lot more to say about that until Tux3 gets to the benchmark stage, and they we will be benchmarking mainly against Ext3, XFS and Btrfs. Matt seems somewhat cavalier about running out of space on small volumes, whereas I think a filesystem should scale all the way from a handful of meg to at least terabytes and preferably petabytes. The heavy use of a vacuum like reblocking process seems undesirable to me. I like my disk lights to go out as soon as the data is safely on the platter, not continue flashing for minutes or hours after every period of activity. Admittedly, I did contemplate something similar for ddsnap, to improve write efficiency. I now think that fragmentation can be held down to a dull roar without relying on a defragger, and that defragging should only be triggered at planned times by an administrator. We will see what happens in practice. Tux3 has a much better btree fanout than Hammer, 256 vs 64 for Hammer using the same size 4K btree index blocks. Fanout is an important determinant of the K in O(log(N)) btree performance, which turns out to be very important when comparing different filesystems, all of which theoretically are Log(N), but some of which have an inconveniently large K (ZFS comes to mind). I always try to make the fanout as high as possible in my btree, which for example is a major reason that the HTree index for Ext3 performs so well. Actually, I think I can boost the Tux3 inode table btree fanout up to 512 by having a slightly different format for the next-to-terminal inode table index blocks with 16 bits inum, 48 bits leaf block, because at the near-terminal index nodes the inum space is already divided down to a small range. More comments about Hammer later as I learn more about it. Regards, Daniel
From: Matthew Dillon Subject: Re: [Tux3] Comparison to Hammer fs design Date: Jul 25, 11:53 am 2008 :Hi; : :The announcement of yet another filesystem: : :http://lkml.org/lkml/2008/7/23/257 : :led to some comments about hammer fs: : :http://tux3.org/pipermail/tux3/2008-July/000006.html : :enjoy, : : Pedro. Those are interesting comments. I think I found Daniel's email address so I am adding him to the To: Dan, feel free to post this on your Tux groups if you want. I did consider multiple-parentage... that is the ability to have a writable snapshot that 'forks' the filesystem history. It would be an ultra cool feature to have but I couldn't fit it into the B-Tree model I was using. Explicit snapshotting would be needed to make it work, and the snapshot id would have to be made part of the B-Tree key, which is fine. HAMMER is based around implicit snapshots (being able to do an as-of historical lookup without having explicitly snapshotted bits of the history). I would caution against using B-Tree iterations in historical lookups, B-Trees are only fast if you can pretty much zero in on the exact element you want as part of the primary search mechanic. Once you have to iterate or chain performance goes out the window. I know this because there are still two places where HAMMER sometimes has to iterate due to the inexact nature of an as-of lookup. Multiple-parentage almost certainly require two inequality searches, or one inequality search and an iteration. A single timeline only requires one inequality search. I couldn't get away from having a delete_tid (the 'death version numbers'). I really tried :-) There are many cases where one is only deleting, rather then overwriting. Both numbers are then needed to be able to properly present a historical view of the filesystem. For example, when one is deleting a directory entry a snapshot after that deletion must show the directory entry gone. Both numbers are also needed to be able to properly prune out unwanted historical data from the filesystem. HAMMER's pruning algorithm (cleaning out old historical data which is no longer desired) creates holes in the sequence so once you start pruning out unwanted historical data the delete_tid of a prior record will not match the create_tid of the following one (historically speaking). At one point I did collapse the holes, rematching the delete_tid with the create_tid of the following historical record, but I had to remove that code because it seriously complicated the mirroring implementation. I wanted each mirror to be able to have its own historical retention policy independant of the master. e.g. so you could mirror to a backup system which would retain a longer and more coarse-grained history then the production system. -- I also considered increasing the B-Tree fan-out to 256 but decided against it because insertions and deletions really bloated up the UNDO FIFO. Frankly I'm still undecided as to whether that was a good idea, I would have prefered 256. I can actually compile in 256 by changing a #define, but I released with 64 because I hit up against a number of performance issues: bcopy() overheads for insertions and deletions in certain tests became very apparent. Locking conflicts became a much more serious issue because I am using whole-node locks rather then element locks. And, finally, the UNDO records got really bloated. I would need to adjust the node locking and UNDO generation code significantly to remove the bottlenecks before I could go to a 256-element B-Tree node. HAMMER's B-Tree elements are probably huge compared to Tux3, and that's another limiting factor for the fan-out I can use. My elements are 64 bytes each. 64x64 = 4K per B-Tree node. I decided to go with fully expanded keys: 64 bit object id, 64 bit file-offset/db-key, 64 bit create_tid, 64 bit delete_tid), plus a 64-bit storage offset and other auxillary info. That's instead of using a radix-compressed key. Radix compressed keys would have doubled the complexity of the B-Tree code, particularly with the middle-of-tree pointer caching that HAMMER does. The historical access mechanic alone added major complexity to the B-Tree algorithms. I will note here that B-Tree searches have a very high cpu overhead no matter how you twist it, and being able to cache cursors within the B-Tree is absolutely required if you want the filesystem to perform well. If you always start searches at the root your cpu overhead will be horrendous... so plan on caching cursors from the get-go. If I were to do radix compression I would also want to go with a fully dynamic element size and fully dynamic fan-out in order to best-compress each B-Tree node. Definitely food for thought. I'd love to do something like that. I think radix compression would remove much of the topological bloat the B-Tree creates verses using blockmaps, generally speaking. Space management is currently HAMMER's greatest weakness, but it only applies to small storage systems. Several things in HAMMER's design are simply not condusive to a small storage. The storage model is not fine-grained and (unless you are deleting a lot of stuff) needs reblocking to actually recover the freed space. The flushing algorithms need around 100MB of UNDO FIFO space on-media to handle worst-case dependancies (mainly directory/file visibility issues on crash recovery), and the front-end caching of modifying operations, since they don't know exactly how much actual storage will be needed, need ~16MB of wiggle room to be able to estimate the on-media storage required to back the operations cached by the front-end. Plus, on top of that, any sort of reblocking also needs some wiggle room to be able to clean out partially empty big-blocks and fill in new ones. On the flip side, the reblocker doesn't just de-fragment the filesystem, it is also intended to be used for expanding and contracting partitions, and adding removing volumes in the next release. Definitely a multi-purpose utility. So I'm not actually being cavalier about it, its just that I had to make some major decisions on the algorithm design and I decided to weight the design more towards performance and large-storage, and small-storage suffered a bit. -- In anycase, it sounds like Tux3 is using many similar ideas. I think you are on the right track. I will add one big note of caution, drawing from my experience implementing HAMMER, because I think you are going to hit a lot of the same issues. I spent 9 months designing HAMMER and 9 months implementing it. During the course of implementing it I wound up throwing away probably 80% of the original design outright. Amoung the major components I had to rewrite were the LOG mechanic (which ultimately became the meta-data UNDO FIFO), and the fine-grained storage mechanic (which ultimately became coarse-grained). The UNDO FIFO was actually the saving grace, once that was implemented most of the writer-ordering dependancies went away (devolved into just flushing meta-data buffers after syncing the UNDO FIFO)... I suddenly had much, much more freedom in designing the other on-disk structures and algorithms. What I found while implementing HAMMER was that the on-disk topological design essentially dictated the extent of HAMMER's feature set AND most of its deficiencies (such as having to reblock to recover space), and the algorithms I chose often dictated other restrictions. But the major restrictions came from the on-disk structures. Because of the necessarily tight integration between subsystems I found myself having to do major redesigns during the implementation phase. Fixing one subsystem created a cascade effect that required tweaking other subsystems. Even adding new features, such as the mirroring, required significant changes in the B-Tree deadlock recovery code. I couldn't get away from it. Ultimately I chose to simplify some of the algorithms rather then have to go through another 4 months of rewriting. All major designs are an exercise in making trade-offs in topology, feature set, algorithmic complexity, debuggability, robustness, etc. The list goes on forever. Laters! -Matt Matthew Dillon <dillon@backplane.com>
From: Daniel Phillips Subject: Re: [Tux3] Comparison to Hammer fs design Date: Jul 25, 3:54 pm 2008 (resent after subscribing to the the kernel@crater) On Friday 25 July 2008 11:53, Matthew Dillon wrote: > > :Hi; > : > :The announcement of yet another filesystem: > : > :http://lkml.org/lkml/2008/7/23/257 > : > :led to some comments about hammer fs: > : > :http://tux3.org/pipermail/tux3/2008-July/000006.html > : > :enjoy, > : > : Pedro. > > Those are interesting comments. I think I found Daniel's email address > so I am adding him to the To: Dan, feel free to post this on your Tux > groups if you want. How about a cross post? > I did consider multiple-parentage... that is the ability to have a > writable snapshot that 'forks' the filesystem history. It would be > an ultra cool feature to have but I couldn't fit it into the B-Tree > model I was using. Explicit snapshotting would be needed to make it > work, and the snapshot id would have to be made part of the B-Tree key, > which is fine. HAMMER is based around implicit snapshots (being able > to do an as-of historical lookup without having explicitly snapshotted > bits of the history). Yes, that is the main difference indeed, essentially "log everything" vs "commit" style versioning. The main similarity is the lifespan oriented version control at the btree leaves. > I would caution against using B-Tree iterations > in historical lookups, B-Trees are only fast if you can pretty much > zero in on the exact element you want as part of the primary search > mechanic. Once you have to iterate or chain performance goes out the > window. I know this because there are still two places where HAMMER > sometimes has to iterate due to the inexact nature of an as-of lookup. > Multiple-parentage almost certainly require two inequality searches, > or one inequality search and an iteration. A single timeline only > requires one inequality search. Once I get down to the leaf level I binary search on logical address in the case of a file index btree or on version in the case of an inode table block, so this cost is still Log(N) with a small k. For a heavily versioned inode or file region this might sometimes result in an overflow block or two that has to be linearly searched which is not a big problem so long as it is a rare case, which it really ought to be for the kinds of filesystem loads I have seen. A common example of a worst case is /var/log/messages, where the mtime and size are going to change in pretty much every version, so if you have hourly snapshots and hold them for three months it adds up to about 2200 16 byte inode table attributes to record, about 8 inode table leaf blocks. I really do not see that as a problem. If it goes up to 100 leaf blocks to search then that could be a problem. Usually, I will only be interested in the first and last of those blocks, the first holding the rarely changing attributes including the root of the file index btree and the last holding the most recent incarnation of the mtime/size attribute. The penultimate inode table index block tells me how many blocks a given inode lives in because several blocks will have the same inum key. So the lookup algorithm for a massively versioned file becomes: 1) read the first inode table block holding that inum; 2) read the last block with the same inum. The latter operation only needs to consult the immediate parent index block, which is locked in the page cache at that point. It will take a little care and attention to the inode format, especially the inode header, to make sure that I can reliably do that first/last probe optimization for the "head" version, but it does seem worth the effort. For starters I will just do a mindless linear walk of an "overflow" inode and get fancy if that turns out to be a problem. > I couldn't get away from having a delete_tid (the 'death version > numbers'). I really tried :-) There are many cases where one is only > deleting, rather then overwriting. By far the most common case I would think. But check out the versioned pointer algorithms. Surprisingly that just works, which is not exactly obvious: Versioned pointers: a new method of representing snapshots http://lwn.net/Articles/288896/ I was originally planning to keep all versions of a truncate/rewrite file in the same file index, but recently I realized that that is dumb because there will never be any file data shared by a successor version in that case. So the thing to do is just create an entirely new versioned file data attribute for each rewrite, bulking up the inode table entry a little but greatly constraining the search for versions to delete and reducing cache pressure by not loading unrelated version data when traversing a file. > Both numbers are then needed to > be able to properly present a historical view of the filesystem. > For example, when one is deleting a directory entry a snapshot after > that deletion must show the directory entry gone. ...which happens in Tux3 courtesy of the fact that the entire block containing the dirent will have been versioned, with the new version showing the entry gone. Here is one of two places where I violate my vow to avoid copying an entire block when only one data item in it changes (the other being the atime table). I rely on two things to make this nice: 1) Most dirent changes will be logically logged and only rolled up into the versioned file blocks when there are enough to be reasonably sure that each changed directory block will be hit numerous times in each rollup episode. (Storing the directory blocks in dirent-create order as in PHTree makes this very likely for mass deletes.) 2) When we care about this is usually during a mass delete, where most or all dirents in each directory file block are removed before moving on to the next block. > Both numbers are > also needed to be able to properly prune out unwanted historical data > from the filesystem. HAMMER's pruning algorithm (cleaning out old > historical data which is no longer desired) creates holes in the > sequence so once you start pruning out unwanted historical data > the delete_tid of a prior record will not match the create_tid of the > following one (historically speaking). Again, check out the versioned pointer algorithms. You can tell what can be pruned just by consulting the version tree and the create_tids (birth versions) for a particular logical address. Maybe the hang is that you do not organize the btrees by logical address (or inum in the case of the inode table tree). I thought you did but have not read closely enough to be sure. > At one point I did collapse the holes, rematching the delete_tid with > the create_tid of the following historical record, but I had to remove > that code because it seriously complicated the mirroring implementation. > I wanted each mirror to be able to have its own historical retention > policy independant of the master. e.g. so you could mirror to a backup > system which would retain a longer and more coarse-grained history then > the production system. Fair enough. I have an entirely different approach to what you call mirroring and what I call delta replication. (I reserve the term mirroring to mean mindless duplication of physical media writes.) This method proved out well in ddsnap: http://phunq.net/ddtree?p=zumastor/.git;a=tree;h=fc5cb496fff10a2b03034fcf95122f5828149... (Sorry about the massive URL, you can blame Linus for that;-) What I do in ddsnap is compute all the blocks that differ between two versions and apply those to a remote volume already holding the first of the two versions, yielding a replica of the second version that is logically but not physically identical. The same idea works for a versioned filesystem: compute all the leaf data that differs between two versions, per inum, and apply the resulting delta to the corresponding inums in the remote replica. The main difference vs a ddsnap volume delta is that not all of the changes are physical blocks, there are also changed inode attributes, so the delta stream format has to be elaborated accordingly. > I also considered increasing the B-Tree fan-out to 256 but decided > against it because insertions and deletions really bloated up the > UNDO FIFO. Frankly I'm still undecided as to whether that was a good > idea, I would have prefered 256. I can actually compile in 256 by > changing a #define, but I released with 64 because I hit up against > a number of performance issues: bcopy() overheads for insertions > and deletions in certain tests became very apparent. Locking > conflicts became a much more serious issue because I am using whole-node > locks rather then element locks. And, finally, the UNDO records got > really bloated. I would need to adjust the node locking and UNDO > generation code significantly to remove the bottlenecks before I could > go to a 256-element B-Tree node. I intend to log insertions and deletions logically, which keeps each down to a few bytes until a btree rollup episode comes along to perform updating of the btree nodes in bulk. I am pretty sure this will work for you as well, and you might want to check out that forward logging trick. That reminds me, I was concerned about the idea of UNDO records vs REDO. I hope I have this right: you delay acknowledging any write transaction to the submitter until log commit has progressed beyond the associated UNDO records. Otherwise, if you acknowledge, crash, and prune all UNDO changes, then you would end up with applications believing that they had got things onto stable storage and be wrong about that. I have no doubt you did the right thing there, but it is not obvious from your design documentation. > HAMMER's B-Tree elements are probably huge compared to Tux3, and that's > another limiting factor for the fan-out I can use. My elements > are 64 bytes each. Yes, I mostly have 16 byte elements and am working on getting most of them down to 12 or 8. > 64x64 = 4K per B-Tree node. I decided to go > with fully expanded keys: 64 bit object id, 64 bit file-offset/db-key, > 64 bit create_tid, 64 bit delete_tid), plus a 64-bit storage offset and > other auxillary info. That's instead of using a radix-compressed key. > Radix compressed keys would have doubled the complexity of the B-Tree > code, particularly with the middle-of-tree pointer caching that > HAMMER does. I use two-stage lookup, or four stage if you count searches within btree blocks. This makes the search depth smaller in the case of small files, and in the case of really huge files it adds depth exactly as appropriate. The index blocks end up cached in the page cache (the buffer cache is just a flavor of page cache in recent Linux) so there is little repeated descending through the btree indices. Instead, most of the probing is through the scarily fast radix tree code, which has been lovingly optimized over the years to avoid cache line misses and SMP lock contention. I also received a proposal for a "fat" btree index from a collaborator (Maciej) that included the file offset but I really did not like the... fattening. A two level btree yields less metadata overall which in my estimation is more important than saving some bree probes. The way things work these days, falling down from level 1 cache to level 2 or from level 2 to level 3 costs much more than executing some extra CPU instructions. So optimization strategy inexorably shifts away from minimizing instructions towards minimizing cache misses. > The historical access mechanic alone added major complexity to the > B-Tree algorithms. I will note here that B-Tree searches have a very > high cpu overhead no matter how you twist it, and being able to cache > cursors within the B-Tree is absolutely required if you want the > filesystem to perform well. If you always start searches at the root > your cpu overhead will be horrendous... so plan on caching cursors > from the get-go. Fortunately, we get that for free on Linux, courtesy of the page cache radix trees :-) I might eventually add some explicit cursor caching, but various artists over the years have noticed that it does not make as much difference as you might think. > If I were to do radix compression I would also want to go with a > fully dynamic element size and fully dynamic fan-out in order to > best-compress each B-Tree node. Definitely food for thought. Indeed. But Linux is braindamaged about large block size, so there is very strong motivation to stay within physical page size for the immediate future. Perhaps if I get around to a certain hack that has been perenially delayed, that situation will improve: "Variable sized page objects" http://lwn.net/Articles/37795/ > I'd love to do something like that. I think radix compression would > remove much of the topological bloat the B-Tree creates verses using > blockmaps, generally speaking. Topological bloat? > Space management is currently HAMMER's greatest weakness, but it only > applies to small storage systems. Several things in HAMMER's design > are simply not condusive to a small storage. The storage model is not > fine-grained and (unless you are deleting a lot of stuff) needs > reblocking to actually recover the freed space. The flushing algorithms > need around 100MB of UNDO FIFO space on-media to handle worst-case > dependancies (mainly directory/file visibility issues on crash recovery), > and the front-end caching of modifying operations, since they don't > know exactly how much actual storage will be needed, need ~16MB of > wiggle room to be able to estimate the on-media storage required to > back the operations cached by the front-end. Plus, on top of that, > any sort of reblocking also needs some wiggle room to be able to clean > out partially empty big-blocks and fill in new ones. Ah, that is a very nice benefit of Tux3-style forward logging I forgot to mention in the original post: transaction size is limited only by available free space on the volume, because log commit blocks can be anywhere. Last night I dreamed up a log commit block variant where the associated transaction data blocks can be physically discontiguous, to make this assertion come true, shortly after reading Stephen Tweedie's horror stories about what you have to do if you must work around not being able to put an entire, consistent transaction onto stable media in one go: http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html Most of the nasties he mentions just vanish if: a) File data is always committed atomically b) All commit transactions are full transactions He did remind me (via time travel from year 2000) of some details I should write into the design explicitly, for example, logging orphan inodes that are unlinked while open, so they can be deleted on replay after a crash. Another nice application of forward logging, which avoids the seek-happy linked list through the inode table that Ext3 does. > On the flip side, the reblocker doesn't just de-fragment the filesystem, > it is also intended to be used for expanding and contracting partitions, > and adding removing volumes in the next release. Definitely a > multi-purpose utility. Good point. I expect Tux3 will eventually have a reblocker (aka defragger). There are some really nice things you can do, like: 1) Set a new version so logical changes cease for the parent version. 2) We want to bud off a given directory tree into its own volume, so start by deleting the subtree in the current version. If any link counts in the directory tree remain nonzero in the current version then there are hard links into the subtree, so fail now and drop the new version. 3) Reblock a given region of the inode table tree and all the files in it into one physically contiguous region of blocks 4) Add a free tree and other once-per volume structures to the new home region. 5) The new region is now entirely self contained and even keeps its version history. At the volume manager level, map it to a new, sparse volume that just has a superblock in the zeroth extent and the new, mini filesystem at some higher logical address. Remap the region in the original volume to empty space and add the empty space to the free tree. 6) Optionally reblock the newly budded filesystem to the base of the new volume so utilties that do not not play well with sparse volumes do not do silly things. > So I'm not actually being cavalier about it, its just that I had to > make some major decisions on the algorithm design and I decided to > weight the design more towards performance and large-storage, and > small-storage suffered a bit. Cavalier was a poor choice of words, the post was full of typos as well so I am not proud of it. You are solving a somewhat different problem and you have code out now, which is a huge achievement. Still I think you can iteratively improve your design using some of the techniques I have stumbled upon. There are probably some nice tricks I can get from your code base too once I delve into it. > In anycase, it sounds like Tux3 is using many similar ideas. I think > you are on the right track. Thankyou very much. I think you are on the right track too, which you have a rather concrete way of proving. > I will add one big note of caution, drawing > from my experience implementing HAMMER, because I think you are going > to hit a lot of the same issues. > > I spent 9 months designing HAMMER and 9 months implementing it. During > the course of implementing it I wound up throwing away probably 80% of > the original design outright. Amoung the major components I had to > rewrite were the LOG mechanic (which ultimately became the meta-data > UNDO FIFO), and the fine-grained storage mechanic (which ultimately > became coarse-grained). The UNDO FIFO was actually the saving grace, > once that was implemented most of the writer-ordering dependancies went > away (devolved into just flushing meta-data buffers after syncing the > UNDO FIFO)... I suddenly had much, much more freedom in designing > the other on-disk structures and algorithms. > > What I found while implementing HAMMER was that the on-disk topological > design essentially dictated the extent of HAMMER's feature set AND > most of its deficiencies (such as having to reblock to recover space), > and the algorithms I chose often dictated other restrictions. But the > major restrictions came from the on-disk structures. > > Because of the necessarily tight integration between subsystems I found > myself having to do major redesigns during the implementation phase. > Fixing one subsystem created a cascade effect that required tweaking other > subsystems. Even adding new features, such as the mirroring, required > significant changes in the B-Tree deadlock recovery code. I couldn't get > away from it. Ultimately I chose to simplify some of the algorithms > rather then have to go through another 4 months of rewriting. All > major designs are an exercise in making trade-offs in topology, feature > set, algorithmic complexity, debuggability, robustness, etc. The list > goes on forever. > The big ahas! that eliminated much of the complexity in the Tux3 design were: * Forward logging - just say no to incomplete transactions * Version weaving - just say no to recursive copy on write Essentially I have been designing Tux3 for ten years now and working seriously on the simplifying elements for the last three years or so, either entirely on paper or in related work like ddsnap and LVM3. One of the nice things about moving on from design to implementation of Tux3 is that I can now background the LVM3 design process and see what Tux3 really wants from it. I am determined to match every checkbox volume management feature of ZFS as efficiently or more efficiently, without violating the traditional layering between filesystem and block device, and without making LVM3 a Tux3-private invention. > Laters! Hopefully not too much later. I find this dialog very fruitful. I just wish such dialog would occur more often at the design/development stage in Linux and other open source work instead of each group obsessively ignoring all "competing" designs and putting huge energy into chatting away about the numerous bugs that arise from rushing their design or ignoring the teachings of history. Regards, Daniel
From: Matthew Dillon Subject: Re: [Tux3] Comparison to Hammer fs design Date: Jul 25, 7:02 pm 2008 :> so I am adding him to the To: Dan, feel free to post this on your Tux :> groups if you want. : :How about a cross-post? I don't think it will work, only subscribers can post to the DFly groups, but we'll muddle through it :-) I will include the whole of the previous posting so the DFly groups see the whole thing, if you continue to get bounces. I believe I have successfully added you as an 'alias address' to the DragonFly kernel list so you shouldn't get bounced if you Cc it now. :Yes, that is the main difference indeed, essentially "log everything" vs :"commit" style versioning. The main similarity is the lifespan oriented :version control at the btree leaves. Reading this and a little more that you describe later let me make sure I understand the forward-logging methodology you are using. You would have multiple individually-tracked transactions in progress due to parallelism in operations initiated by userland and each would be considered committed when the forward-log logs the completion of that particular operation? If the forward log entries are not (all) cached in-memory that would mean that accesses to the filesystem would have to be run against the log first (scanning backwards), and then through to the B-Tree? You would solve the need for having an atomic commit ('flush groups' in HAMMER), but it sounds like the algorithmic complexity would be very high for accessing the log. And even though you wouldn't have to group transactions into larger commits the crash recovery code would still have to implement those algorithms to resolve directory and file visibility issues. The problem with namespace visibility is that it is possible to create a virtually unending chain of separate but inter-dependant transactions which either all must go, or none of them. e.g. creating a, a/b, a/b/c, a/b/x, a/b/c/d, etc etc. At some point you have to be able to commit so the whole mess does not get undone by a crash, and many completed mini-transactions (file or directory creates) actually cannot be considered complete until their governing parent directories (when creating) or children (when deleting) have been committed. The problem can become very complex. Last question here: Your are forward-logging high level operations. You are also going to have to log meta-data (actual B-Tree manipulation) commits in order to recover from a crash while making B-Tree modifications. Correct? So your crash recovery code will have to handle both meta-data undo and completed and partially completed transactions. And there will have to be a tie-in between the meta-data commits and the transactions so you know which ones have to be replayed. That sounds fairly hairy. Have you figured out how you are doing to do that? :Once I get down to the leaf level I binary search on logical address in :the case of a file index btree or on version in the case of an inode :table block, so this cost is still Log(N) with a small k. For a :heavily versioned inode or file region this might sometimes result in :an overflow block or two that has to be linearly searched which is not :a big problem so long as it is a rare case, which it really ought to be :for the kinds of filesystem loads I have seen. A common example of a :worst case is /var/log/messages, where the mtime and size are going to :change in pretty much every version, so if you have hourly snapshots :and hold them for three months it adds up to about 2200 16 byte inode :table attributes to record, about 8 inode table leaf blocks. I really :do not see that as a problem. If it goes up to 100 leaf blocks to :search then that could be a problem. I think you can get away with this as long as you don't have too many snapshots, and even if you do I noticed with HAMMER that only a small percentage of inodes have a large number of versions associated with them from normal production operation. /var/log/messages is an excellent example of that. Log files were effected the most though I also noticed that very large files also wind up with multiple versions of the inode, such as when writing out a terrabyte-sized file. Even with a direct bypass for data blocks (but not their meta-data, clearly), HAMMER could only cache so much meta-data in memory before it had to finalize the topology and flush the inode out. A terrabyte-sized file wound up with about 1000 copies of the inode prior to pruning (one had to be written out about every gigabyte or so). :The penultimate inode table index block tells me how many blocks a :given inode lives in because several blocks will have the same inum :key. So the lookup algorithm for a massively versioned file becomes: :1) read the first inode table block holding that inum; 2) read the last :block with the same inum. The latter operation only needs to consult :the immediate parent index block, which is locked in the page cache at :that point. How are you dealing with expansion of the logical inode block(s) as new versions are added? I'm assuming you are intending to pack the inodes on the media so e.g. a 128-byte inode would only take up 128 bytes of media space in the best case. Multiple inodes would be laid out next to each other logically (I assume), but since the physical blocks are larger they would also have to be laid out next to each other physically within any given backing block. Now what happens when one has to be expanded? I'm sure this ties into the forward-log but even with the best algorithms you are going to hit limited cases where you have to expand the inode. Are you just copying the inode block(s) into a new physical allocation then? :It will take a little care and attention to the inode format, :especially the inode header, to make sure that I can reliably do that :first/last probe optimization for the "head" version, but it does seem :worth the effort. For starters I will just do a mindless linear walk :of an "overflow" inode and get fancy if that turns out to be a problem. : Makes sense. I was doing mindless linear walks of the B-Tree element arrays for most of HAMMER's implementation until it was stabilized well enough that I could switch to a binary search. And I might change it again to start out with a heuristical index estimation. :> I couldn't get away from having a delete_tid (the 'death version :> numbers'). I really tried :-) There are many cases where one is only :> deleting, rather then overwriting. : :By far the most common case I would think. But check out the versioned :pointer algorithms. Surprisingly that just works, which is not exactly :obvious: : : Versioned pointers: a new method of representing snapshots : http://lwn.net/Articles/288896/ Yes, it makes sense. If the snapshot is explicitly taken then you can store direct references and chain, and you wouldn't need a delete id in that case. From that article though the chain looks fairly linear. Historical access could wind up being rather costly. :I was originally planning to keep all versions of a truncate/rewrite :file in the same file index, but recently I realized that that is dumb :because there will never be any file data shared by a successor version :in that case. So the thing to do is just create an entirely new :versioned file data attribute for each rewrite, bulking up the inode :table entry a little but greatly constraining the search for versions :to delete and reducing cache pressure by not loading unrelated version :data when traversing a file. When truncating to 0 I would agree with your assessment. For the topology you are using you would definitely want to use different file data sets. You also have to deal with truncations that are not to 0, that might be to the middle of a file. Certainly not a common case, but still a case that has to be coded for. If you treat truncation to 0 as a special case you will be adding considerable complexity to the algorithm. With HAMMER I chose to keep everything in one B-Tree, whether historical or current. That way one set of algorithms handles both cases and code complexity is greatly reduced. It isn't optimal... large amounts of built up history still slow things down (though in a bounded fashion). In that regard creating a separate topology for snapshots is a good idea. :> Both numbers are then needed to :> be able to properly present a historical view of the filesystem. :> For example, when one is deleting a directory entry a snapshot after :> that deletion must show the directory entry gone. : :...which happens in Tux3 courtesy of the fact that the entire block :containing the dirent will have been versioned, with the new version :showing the entry gone. Here is one of two places where I violate my :vow to avoid copying an entire block when only one data item in it :changes (the other being the atime table). I rely on two things to :make this nice: 1) Most dirent changes will be logically logged and :only rolled up into the versioned file blocks when there are enough to :be reasonably sure that each changed directory block will be hit :numerous times in each rollup episode. (Storing the directory blocks :in dirent-create order as in PHTree makes this very likely for mass :deletes.) 2) When we care about this is usually during a mass delete, :where most or all dirents in each directory file block are removed :before moving on to the next block. This could wind up being a sticky issue for your implementation. I like the concept of using the forward-log but if you actually have to do a version copy of the directory at all you will have to update the link (or some sort of) count for all the related inodes to keep track of inode visibility, and to determine when the inode can be freed and its storage space recovered. Directories in HAMMER are just B-Tree elements. One element per directory-entry. There are no directory blocks. You may want to consider using a similar mechanism. For one thing, it makes lookups utterly trivial... the file name is hashed and a lookup is performed based on the hash key, then B-Tree elements with the same hash key are iterated until a match is found (usually the first element is the match). Also, when adding or removing directory entries only the directory inode's mtime field needs to be updated. It's size does not. Your current directory block method could also represent a problem for your directory inodes... adding and removing directory entries causing size expansion or contraction could require rolling new versions of the directory inode to update the size field. You can't hold too much in the forward-log without some seriously indexing. Another case to consider along with terrabyte-sized files and log files. :> Both numbers are :> also needed to be able to properly prune out unwanted historical data :> from the filesystem. HAMMER's pruning algorithm (cleaning out old :> historical data which is no longer desired) creates holes in the :> sequence so once you start pruning out unwanted historical data :> the delete_tid of a prior record will not match the create_tid of the :> following one (historically speaking). : :Again, check out the versioned pointer algorithms. You can tell what :can be pruned just by consulting the version tree and the create_tids :(birth versions) for a particular logical address. Maybe the hang is :that you do not organize the btrees by logical address (or inum in the :case of the inode table tree). I thought you did but have not read :closely enough to be sure. Yes, I see. Can you explain the versioned pointer algorithm a bit more, it looks almost like a linear chain (say when someone is doing a daily snapshot). It looks great for optimal access to the HEAD but it doesn't look very optimal if you want to dive into an old snapshot. For informational purposes: HAMMER has one B-Tree, organized using a strict key comparison. The key is made up of several fields which are compared in priority order: localization - used to localize certain data types together and to separate pseudo filesystems created within the filesystem. obj_id - the object id the record is associated with. Basically the inode number the record is associated with. rec_type - the type of record, e.g. INODE, DATA, SYMLINK-DATA, DIRECTORY-ENTRY, etc. key - e.g. file offset create_tid - the creation transaction id Inodes are grouped together using the localization field so if you have a million inodes and are just stat()ing files, the stat information is localized relative to other inodes and doesn't have to skip file contents or data, resulting in highly localized accesses on the storage media. Beyond that the B-Tree is organized by inode number and file offset. In the case of a directory inode, the 'offset' is the hash key, so directory entries are organized by hash key (part of the hash key is an iterator to deal with namespace collisions). The structure seems to work well for both large and small files, for ls -lR (stat()ing everything in sight) style traversals as well as tar-like traversals where the file contents for each file is read. The create_tid is the creation transaction id. Historical accesses are always 'ASOF' a particular TID, and will access the highest create_tid that is still <= the ASOF TID. The 'current' version of the filesystem uses an asof TID of 0xFFFFFFFFFFFFFFFF and hence accesses the highest create_tid. There is also a delete_tid which is used to filter out elements that were deleted prior to the ASOF TID. There is currently an issue with HAMMER where the fact that the delete_tid is *not* part of the B-Tree compare can lead to iterations for strictly deleted elements, verses replaced elements which will have a new element with a higher create_tid that the B-Tree can seek to directly. In fact, at one point I tried indexing on delete_tid instead of create_tid, but created one hell of a mess in that I had to physically *move* B-Tree elements being deleted instead of simply marking them as deleted. :Fair enough. I have an entirely different approach to what you call :mirroring and what I call delta replication. (I reserve the term :mirroring to mean mindless duplication of physical media writes.) This :method proved out well in ddsnap: : : http://phunq.net/ddtree?p=zumastor/.git;a=tree;h=fc5cb496fff10a2b03034fcf95122f5828149... : :(Sorry about the massive URL, you can blame Linus for that;-) : :What I do in ddsnap is compute all the blocks that differ between two :versions and apply those to a remote volume already holding the first :of the two versions, yielding a replica of the second version that is :logically but not physically identical. The same idea works for a :versioned filesystem: compute all the leaf data that differs between :two versions, per inum, and apply the resulting delta to the :corresponding inums in the remote replica. The main difference vs a :ddsnap volume delta is that not all of the changes are physical blocks, :there are also changed inode attributes, so the delta stream format :has to be elaborated accordingly. Are you scanning the entire B-Tree to locate the differences? It sounds you would have to as a fall-back, but that you could use the forward-log to quickly locate differences if the first version is fairly recent. HAMMER's mirroring basically works like this: The master has synchronized up to transaction id C, the slave has synchronized up to transaction id A. The mirroring code does an optimized scan of the B-Tree to supply all B-Tree elements that have been modified between transaction A and C. Any number of mirroring targets can be synchronized to varying degrees, and can be out of date by varying amounts (even years, frankly). I decided to use the B-Tree to optimize the scan. The B-Tree is scanned and any element with either a creation or deletion transaction id >= the slave's last synchronization point is then serialized and piped to the slave. To avoid having to scan the entire B-Tree I perform an optimization whereby the highest transaction id laid down at a leaf is propagated up the B-Tree all the way to the root. This also occurs if a B-Tree node is physically deleted (due to pruning), even if no elements are actually present at the leaf within the transaction range. Thus the mirroring scan is able to skip any internal node (and its entire sub-tree) which has not been modified after the synchronization point, and is able to identify any leaf for which real, physical deletions have occured (in addition to logical deletions which simply set the delete_tid field in the B-Tree element) and pass along the key range and any remaining elements in that leaf for the target to do a comparative scan with. This allows incremental mirroring, multiple slaves, and also allows a mirror to go offline for months and then pop back online again and optimally pick up where it left off. The incremental mirroring is important, the last thing I want to do is have to scan 2 billion B-Tree elements to do an incremental mirroring batch. The advantage is all the cool features I got by doing things that way. The disadvantage is that the highest transaction id must be propagated up the tree (though it isn't quite that bad because in HAMMER an entire flush group uses the same transaction id, so we aren't constantly repropagating new transaction id's up the same B-Tree nodes when flushing a particular flush group). You may want to consider something similar. I think using the forward-log to optimize incremental mirroring operations is also fine as long as you are willing to take the 'hit' of having to scan (though not have to transfer) the entire B-Tree if the mirror is too far out of date. :I intend to log insertions and deletions logically, which keeps each :down to a few bytes until a btree rollup episode comes along to perform :updating of the btree nodes in bulk. I am pretty sure this will work :for you as well, and you might want to check out that forward logging :trick. Yes. The reason I don't is because while it is really easy to lay down a forward-log, intergrating it into lookup operations (if you don't keep all the references cached in-memory) is really complex code. I mean, you can always scan it backwards linearly and that certainly is easy to do, but the filesystem's performance would be terrible. So you have to index the log somehow to allow lookups on it in reverse to occur reasonably optimally. Have you figured out how you are going to do that? With HAMMER I have an in-memory cache and the on-disk B-Tree. Just writing the merged lookup code (merging the in-memory cache with the on-disk B-Tree for the purposes of doing a lookup) was fairly complex. I would hate to have to do a three-way merge.... in-memory cache, on-disk log, AND on-disk B-Tree. Yowzer! :That reminds me, I was concerned about the idea of UNDO records vs :REDO. I hope I have this right: you delay acknowledging any write :transaction to the submitter until log commit has progressed beyond the :associated UNDO records. Otherwise, if you acknowledge, crash, and :prune all UNDO changes, then you would end up with applications :believing that they had got things onto stable storage and be wrong :about that. I have no doubt you did the right thing there, but it is :not obvious from your design documentation. The way it works is that HAMMER's frontend is almost entirely disconnected from the backend. All meta-data operations are cached in-memory. create, delete, append, truncate, rename, write... you name it. Nothing the frontend does modifies any meta-data or mata-data buffers. The only sychronization point is fsync(), the filesystem syncer, and of course if too much in-memory cache is built up. To improve performance, raw data blocks are not included... space for raw data writes is reserved by the frontend (without modifying the storage allocation layer) and those data buffers are written to disk by the frontend directly, just without any meta-data on-disk to reference them so who cares if you crash then! It would be as if those blocks were never allocated in the first place. When the backend decides to flush the cached meta-data ops it breaks the meta-data ops into flush-groups whos dirty meta-data fits in the system's buffer cache. The meta-data ops are executed, building the UNDO, updating the B-Tree, allocating or finalizing the storage, and modifying the meta-data buffers in the buffer cache. BUT the dirty meta-data buffers are locked into memory and NOT yet flushed to the media. The UNDO *is* flushed to the media, so the flush groups can build a lot of UNDO up and flush it as they go if necessary. When the flush group has completed any remaining UNDO is flushed to the media, we wait for I/O to complete, the volume header's UNDO FIFO index is updated and written out, we wait for THAT I/O to complete, *then* the dirty meta-data buffers are flushed to the media. The flushes at the end are done asynchronously (unless fsync()ing) and can overlap with the flushes done at the beginning of the next flush group. So there are exactly two physical synchronization points for each flush. If a crash occurs at any point, upon remounting after a reboot HAMMER only needs to run the UNDOs to undo any partially committed meta-data. That's it. It is fairly straight forward. For your forward-log approach you would have to log the operations as they occur, which is fine since that can be cached in-memory. However, you will still need to synchronize the log entries with a volume header update to update the log's index, so the recovery code knows how far the log extends. Plus you would also have to log UNDO records when making actual changes to the permanent media structures (your B-Trees), which is independant of the forward-log entries you made representing high level filesystem operations, and would also have to lock the related meta-data in memory until the related log entries can be synchronized. Then you would be able to flush the meta-data buffers. The forward-log approach is definitely more fine-grained, particularly for fsync() operations... those would go much faster using a forward log then the mechanism I use because only the forward-log would have to be synchronized (not the meta-data changes) to 'commit' the work. I like that, it would be a definite advantage for database operations. :> HAMMER's B-Tree elements are probably huge compared to Tux3, and that's :> another limiting factor for the fan-out I can use. My elements :> are 64 bytes each. : :Yes, I mostly have 16 byte elements and am working on getting most of :them down to 12 or 8. I don't know how you can make them that small. I spent months just getting my elements down to 64 bytes. The data reference alone for data blocks is 12 bytes (64 bit media storage offset and 32 bit length). :> 64x64 = 4K per B-Tree node. I decided to go :> with fully expanded keys: 64 bit object id, 64 bit file-offset/db-key, :> 64 bit create_tid, 64 bit delete_tid), plus a 64-bit storage offset and :> other auxillary info. That's instead of using a radix-compressed key. :> Radix compressed keys would have doubled the complexity of the B-Tree :> code, particularly with the middle-of-tree pointer caching that :> HAMMER does. : :I use two-stage lookup, or four stage if you count searches within :btree blocks. This makes the search depth smaller in the case of small :files, and in the case of really huge files it adds depth exactly as :appropriate. The index blocks end up cached in the page cache (the :buffer cache is just a flavor of page cache in recent Linux) so there :is little repeated descending through the btree indices. Instead, most :of the probing is through the scarily fast radix tree code, which has :been lovingly optimized over the years to avoid cache line misses and :SMP lock contention. : :I also received a proposal for a "fat" btree index from a collaborator :(Maciej) that included the file offset but I really did not like the... :fattening. A two level btree yields less metadata overall which in my :estimation is more important than saving some bree probes. The way :things work these days, falling down from level 1 cache to level 2 or :from level 2 to level 3 costs much more than executing some extra CPU :instructions. So optimization strategy inexorably shifts away from :minimizing instructions towards minimizing cache misses. The depth and complexity of your master B-Tree will definitely be smaller. You are offloading both the file contents and snapshots. HAMMER incorporates both into a single global B-Tree. This has been somewhat of a mixed bag for HAMMER. On the plus side performance is very good for production operations (where the same filename is not being created over and over and over and over again and where the same file is not truncated over and over and over again).... locality of reference is extremely good in a single global B-Tree when accessing lots of small files or when accessing fewer larger files. On the negative side, performance drops noticeably if a lot of historical information (aka snapshots) has built up in the data set being accessed. Even though the lookups themselves are extremely efficient, the access footprint (the size of the data set the system must cache) of the B-Tree becomes larger, sometimes much larger. I think the cpu overhead is going to be a bit worse then you are contemplating, but your access footprint (with regards to system memory use caching the meta-data) for non-historical accesses will be better. Are you going to use a B-Tree for the per-file layer or a blockmap? And how are you going to optimize the storage for small files? Are you just going to leave them in the log and not push a B-Tree for them? :> The historical access mechanic alone added major complexity to the :> B-Tree algorithms. I will note here that B-Tree searches have a very :> high cpu overhead no matter how you twist it, and being able to cache :> cursors within the B-Tree is absolutely required if you want the :> filesystem to perform well. If you always start searches at the root :> your cpu overhead will be horrendous... so plan on caching cursors :> from the get-go. : :Fortunately, we get that for free on Linux, courtesy of the page cache :radix trees :-) : :I might eventually add some explicit cursor caching, but various :artists over the years have noticed that it does not make as much :difference as you might think. For uncacheable data sets the cpu overhead is almost irrelevant. But for cached data sets, watch out! The cpu overhead of your B-Tree lookups is going to be 50% of your cpu time, with the other 50% being the memory copy or memory mapping operation and buffer cache operations. It is really horrendous. When someone is read()ing or write()ing a large file the last thing you want to do is traverse the same 4 nodes and do four binary searches in each of those nodes for every read(). For large fully cached data sets not caching B-Tree cursors will strip away 20-30% of your performance once your B-Tree depth exceeds 4 or 5. Also, the fan-out does not necessarily help there because the search within the B-Tree node costs almost as much as moving between B-Tree nodes. I found this out when I started comparing HAMMER performance with UFS. For the fully cached case UFS was 30% faster until I started caching B-Tree cursors. It was definitely noticable once my B-Tree grew past a million elements or so. It disappeared completely when I started caching cursors into the B-Tree. :> If I were to do radix compression I would also want to go with a :> fully dynamic element size and fully dynamic fan-out in order to :> best-compress each B-Tree node. Definitely food for thought. : :Indeed. But Linux is braindamaged about large block size, so there is :very strong motivation to stay within physical page size for the :immediate future. Perhaps if I get around to a certain hack that has :been perenially delayed, that situation will improve: : : "Variable sized page objects" : http://lwn.net/Articles/37795/ I think it will be an issue for people trying to port HAMMER. I'm trying to think of ways to deal with it. Anyone doing an initial port can just drop all the blocks down to 16K, removing the problem but increasing the overhead when working with large files. :> I'd love to do something like that. I think radix compression would :> remove much of the topological bloat the B-Tree creates verses using :> blockmaps, generally speaking. : :Topological bloat? Bytes per record. e.g. the cost of creating a small file in HAMMER is 3 B-Tree records (directory entry + inode record + one data record), plus the inode data, plus the file data. For HAMMER that is 64*3 + 128 + 112 (say the file is 100 bytes long, round up to a 16 byte boundary)... so that is 432 bytes. The bigger cost is when creating and managing a large file. A 1 gigabyte file in HAMMER requires 1G/65536 = 16384 B-Tree elements, which comes to 1 megabyte of meta-data. If I were to radix-compress those elements the meta-data overhead would probably be cut to 300KB, possibly even less. Where this matters is that it directly effects the meta-data footprint in the system caches which in turn directly effects the filesystem's ability to cache information without having to go to disk. It can be a big deal. :Ah, that is a very nice benefit of Tux3-style forward logging I forgot :to mention in the original post: transaction size is limited only by :available free space on the volume, because log commit blocks can be :anywhere. Last night I dreamed up a log commit block variant where the :associated transaction data blocks can be physically discontiguous, to :make this assertion come true, shortly after reading Stephen Tweedie's :horror stories about what you have to do if you must work around not :being able to put an entire, consistent transaction onto stable media :in one go: Yes, that's an advantage, and one of the reasons why HAMMER is designed for large filesystems and not small ones. Without a forward-log HAMMER has to reserve more media space for sequencing its flushes. Adding a forward log to HAMMER is possible, I might do it just for quick write()/fsync() style operations. I am still very wary of the added code complexity. : http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html : :Most of the nasties he mentions just vanish if: : : a) File data is always committed atomically : b) All commit transactions are full transactions Yup, which you get for free with your forward-log. :He did remind me (via time travel from year 2000) of some details I :should write into the design explicitly, for example, logging orphan :inodes that are unlinked while open, so they can be deleted on replay :after a crash. Another nice application of forward logging, which :avoids the seek-happy linked list through the inode table that Ext3 :does. Orphan inodes in HAMMER will always be committed to disk with a 0 link count. The pruner deals with them after a crash. Orphan inodes can also be commited due to directory entry dependancies. :> On the flip side, the reblocker doesn't just de-fragment the filesystem, :> it is also intended to be used for expanding and contracting partitions, :> and adding removing volumes in the next release. Definitely a :> multi-purpose utility. : :Good point. I expect Tux3 will eventually have a reblocker (aka :defragger). There are some really nice things you can do, like: : : 1) Set a new version so logical changes cease for the parent : version. : : 2) We want to bud off a given directory tree into its own volume, : so start by deleting the subtree in the current version. If : any link counts in the directory tree remain nonzero in the : current version then there are hard links into the subtree, so : fail now and drop the new version. : : 3) Reblock a given region of the inode table tree and all the files : in it into one physically contiguous region of blocks : : 4) Add a free tree and other once-per volume structures to the new : home region. : : 5) The new region is now entirely self contained and even keeps its : version history. At the volume manager level, map it to a new, : sparse volume that just has a superblock in the zeroth extent and : the new, mini filesystem at some higher logical address. Remap : the region in the original volume to empty space and add the : empty space to the free tree. : : 6) Optionally reblock the newly budded filesystem to the base of the : new volume so utilties that do not not play well with sparse : volumes do not do silly things. Yes, I see. The budding operations is very interesting to me... well, the ability to fork the filesystem and effectively write to a snapshot. I'd love to be able to do that, it is a far superior mechanism to taking a snapshot and then performing a rollback later. Hmm. I could definitely manage more then one B-Tree if I wanted to. That might be the ticket for HAMMER... use the existing snapshot mechanic as backing store and a separate B-Tree to hold all changes made to the snapshot, then do a merged lookup between the new B-Tree and the old B-Tree. That would indeed work. :> So I'm not actually being cavalier about it, its just that I had to :> make some major decisions on the algorithm design and I decided to :> weight the design more towards performance and large-storage, and :> small-storage suffered a bit. : :Cavalier was a poor choice of words, the post was full of typos as well :so I am not proud of it. You are solving a somewhat different problem :and you have code out now, which is a huge achievement. Still I think :you can iteratively improve your design using some of the techniques I :have stumbled upon. There are probably some nice tricks I can get from :your code base too once I delve into it. Meh, you should see my documents when I post them without taking 10 editorial passes. Characters reversed, type-o's, sentences which make no sense, etc. :> In anycase, it sounds like Tux3 is using many similar ideas. I think :> you are on the right track. : :Thankyou very much. I think you are on the right track too, which you :have a rather concrete way of proving. : :.... :> rather then have to go through another 4 months of rewriting. All :> major designs are an exercise in making trade-offs in topology, feature :> set, algorithmic complexity, debuggability, robustness, etc. The list :> goes on forever. :> : :The big ahas! that eliminated much of the complexity in the Tux3 design :were: : : * Forward logging - just say no to incomplete transactions : * Version weaving - just say no to recursive copy on write : :Essentially I have been designing Tux3 for ten years now and working :seriously on the simplifying elements for the last three years or so, :either entirely on paper or in related work like ddsnap and LVM3. : :One of the nice things about moving on from design to implementation of :Tux3 is that I can now background the LVM3 design process and see what :Tux3 really wants from it. I am determined to match every checkbox :volume management feature of ZFS as efficiently or more efficiently, :without violating the traditional layering between filesystem and :block device, and without making LVM3 a Tux3-private invention. : :> Laters! I can tell you've been thinking about Tux for a long time. If I had one worry about your proposed implementation it would be in the area of algorithmic complexity. You have to deal with the in-memory cache, the log, the B-Tree, plus secondary indexing for snapshotted elements and a ton of special cases all over the place. Your general lookup code is going to be very, very complex. My original design for HAMMER was a lot more complex (if you can believe it!) then the end result. A good chunk of what I had to do going from concept to reality was deflate a lot of that complexity. When it got down to brass tacks I couldn't stick with using the delete_tid as a B-Tree search key field, I had to use create_tid. I couldn't use my fine-grained storage model concept because the performance was terrible (too many random writes interfering with streaming I/O). I couldn't use a hybrid B-Tree, where a B-Tree element could hold the base of an entirely new B-Tree (it complicated the pruning and reblocking code so much that I just gave up trying to do it in frustration). I couldn't implement file extents other then 16K and 64K blocks (it really complicated historical lookups and the buffer cache couldn't handle it) <--- that one really annoyed me. I had overly optimized the storage model to try to get block pointers down to 32 bits by localizing B-Tree elements in the same 'super clusters' as the data they referenced. It was a disaster and I ripped it out. The list goes on :-) I do wish we had something like LVM on BSD systems. You guys are very lucky in that regard. LVM is really nice. BTW it took all day to write this! -Matt Matthew Dillon <dillon@backplane.com> :Hopefully not too much later. I find this dialog very fruitful. I just :wish such dialog would occur more often at the design/development stage :in Linux and other open source work instead of each group obsessively :ignoring all "competing" designs and putting huge energy into chatting :away about the numerous bugs that arise from rushing their design or :ignoring the teachings of history. : :Regards, : :Daniel _______________________________________________ Tux3 mailing list Tux3@tux3.org http://tux3.org/cgi-bin/mailman/listinfo/tux3
From: Daniel Phillips Subject: Re: [Tux3] Comparison to Hammer fs design Date: Jul 27, 4:51 am 2008 linSubscribed now, everything should be OK. On Friday 25 July 2008 19:02, Matthew Dillon wrote: > :Yes, that is the main difference indeed, essentially "log everything" vs > :"commit" style versioning. The main similarity is the lifespan oriented > :version control at the btree leaves. > > Reading this and a little more that you describe later let me make > sure I understand the forward-logging methodology you are using. > You would have multiple individually-tracked transactions in > progress due to parallelism in operations initiated by userland and each > would be considered committed when the forward-log logs the completion > of that particular operation? Yes. Writes tend to be highly parallel in Linux because they are mainly driven by the VMM attempting to clean cache dirtied by active writers, who generally do not wait for syncing. So this will work really well for buffered IO, which is most of what goes on in Linux. I have not thought much about how well this works for O_SYNC or O_DIRECT from a single process. I might have to do it slightly differently to avoid performance artifacts there, for example, guess where the next few direct writes are going to land based on where the most recent ones did and commit a block that says "the next few commit blocks will be found here, and here, and here...". When a forward commit block is actually written it contains a sequence number and a hash of its transaction in order to know whether the commit block write ever completed. This introduces a risk that data overwritten by the commit block might contain the same hash and same sequence number in the same position, causing corruption on replay. The chance of this happening is inversely related to the size of the hash times the chance of colliding with the same sequence number in random data times the chance of of rebooting randomly. So the risk can be set arbitrarily small by selecting the size of the hash, and using a good hash. (Incidentally, TEA was not very good when I tested it in the course of developing dx_hack_hash for HTree.) Note: I am well aware that a debate will ensue about whether there is any such thing as "acceptable risk" in relying on a hash to know if a commit has completed. This occurred in the case of Graydon Hoare's Monotone version control system and continues to this day, but the fact is, the cool modern version control systems such as Git and Mercurial now rely very successfully on such hashes. Nonetheless, the debate will keep going, possibly as FUD from parties who just plain want to use some other filesystem for their own reasons. To quell that definitively I need a mount option that avoids all such commit risk, perhaps by providing modest sized journal areas salted throughout the volume whose sole purpose is to record log commit blocks, which then are not forward. Only slightly less efficient than forward logging and better than journalling, which has to seek far away to the journal and has to provide journal space for the biggest possible journal transaction as opposed to the most commit blocks needed for the largest possible VFS transaction (probably one). > If the forward log entries are not (all) cached in-memory that would mean > that accesses to the filesystem would have to be run against the log > first (scanning backwards), and then through to the B-Tree? You > would solve the need for having an atomic commit ('flush groups' in > HAMMER), but it sounds like the algorithmic complexity would be > very high for accessing the log. Actually, the btree node images are kept fully up to date in the page cache which is the only way the high level filesystem code accesses them. They do not reflect exactly what is on the disk, but they do reflect exactly what would be on disk if all the logs were fully rolled up ("replay"). The only operations on forward logs are:  1) Write them  2) Roll them up into the target objects 3) Wait on rollup completion The rollup operation is also used for replay after a crash. A forward log that carries the edits to some dirty cache block pins that dirty block in memory and must be rolled up into a physical log before the cache block can be flushed to disk. Fortunately, such a rollup requires only a predictable amount of memory: space to load enough of the free tree to allocate space for the rollup log, enough available cache blocks to probe the btrees involved, and a few cache blocks to set up the physical log transaction. It is the responsibility of the transaction manager to ensure that sufficient memory to complete the transaction is available before initiating it, otherwise deadlock may occur in the block writeout path. (This requirement is the same as for any other transaction scheme). One traditional nasty case that becomes really nice with logical forward logging is truncate of a gigantic file. We just need to commit a logical update like ['resize', inum, 0] then the inode data truncate can proceed as convenient. Another is orphan inode handling where an open file has been completely unlinked, in which case we log the logical change ['free', inum] then proceed with the actual delete when the file is closed or when the log is replayed after a surprise reboot. Logical log replay is not idempotent, so special care has to be taken on replay to ensure that specified changes have not already been applied to the target object. I choose not to go the traditional route of providing special case tests for "already applied" which get really ugly or unreliable when there are lot of stacked changes. Instead I introduce the rule that a logical change can only be applied to a known good version of the target object, which promise is fullfilled via the physical logging layer. For example, on replay, if we have some btree index on disk for which some logical changes are outstanding then first we find the most recent physically logged version of the index block and read it into cache, then apply the logical changes to it there. Where interdependencies exist between updates, for example the free tree should be updated to reflect a block freed by merging two btree nodes, the entire collection of logical and physical changes has to be replayed in topologically sorted order, the details of which I have not thought much about other than to notice it is always possible. When replay is completed, we have a number of dirty cache blocks which are identical to the unflushed cache blocks at the time of a crash, and we have not yet flushed any of those to disk. (I suppose this gets interesting and probably does need some paranoid flushing logic in replay to handle the bizarre case where a user replays on a smaller memory configuration than they crashed on.) The thing is, replay returns the filesystem to the logical state it was in when the crash happened. This is a detail that journalling filesystem authors tend to overlook: actually flushing out the result of the replay is pointless and only obscures the essential logic. Think about a crash during replay, what good has the flush done? > And even though you wouldn't have to group transactions into larger > commits the crash recovery code would still have to implement those > algorithms to resolve directory and file visibility issues. The problem > with namespace visibility is that it is possible to create a virtually > unending chain of separate but inter-dependant transactions which either > all must go, or none of them. e.g. creating a, a/b, a/b/c, a/b/x, a/b/c/d, > etc etc. I do not see why this example cannot be logically logged in pieces: ['new', inum_a, mode etc] ['link', inum_parent, inum_a, "a"] ['new', inum_b, mode etc] ['link' inum_a, inum_b, "b"] ['new', inum_c, mode etc] ['link' inum_a_b, inum_c, "c"] ['new', inum_x, mode etc] ['link' inum_a_b, inum_x, "x"] ['new', inum_d, mode etc] ['link' inum_a_b_c, inum_d, "d"] Logical updates on one line are in the same logical commit. Logical allocations of blocks to record the possibly split btree leaves and new allocations omitted for clarity. The omitted logical updates are bounded by the depth of the btrees. To keep things simple, the logical log format should be such that it is impossible to overflow one commit block with the updates required to represent a single vfs level transaction. I suspect there may be some terminology skew re the term "transaction". Tux3 understands this as "VFS transaction", which does not include trying to make an entire write (2) call atomic for example, but only such things as allocate+write for a single page cache page or allocate+link for a sys_link call. Fsync(2) is not a transaction but a barrier that Tux3 is free to realize via multiple VFS transactions, which the Linux VFS now takes care of pretty well now after many years of patching up the logic. > At some point you have to be able to commit so the whole mess > does not get undone by a crash, and many completed mini-transactions > (file or directory creates) actually cannot be considered complete until > their governing parent directories (when creating) or children (when > deleting) have been committed. The problem can become very complex. Indeed. I think I have identified a number of techniques for stripping away much of that complexity. For example, the governing parent update can be considered complete as soon as the logical 'link' log commit has completed. > Last question here: Your are forward-logging high level operations. > You are also going to have to log meta-data (actual B-Tree manipulation) > commits in order to recover from a crash while making B-Tree > modifications. Correct? Correct! That is why there are two levels of logging: logical and physical. The physical logging level takes care of updating the cache images of disk blocks to match what the logical logging level expects before it can apply its changes. These two logging levels are interleaved: where a logical change requires splitting a btree block, the resulting blocks are logged logically, but linked into the parent btree block using a logical update stored in the commit block of the physical log transaction. How cool is that? The physical log transactions are not just throwaway things, they are the actual new data. Only the commit block is discarded, which I suppose will leave a lot of one block holes around the volume, but then I do not have to require that the commit block be immediately adjacent to the body of the transaction, which will allow me to get good value out of such holes. On modern rotating media, strictly linear transfers are not that much more efficient than several discontiguous transfers that all land relatively close to each other. > So your crash recovery code will have to handle > both meta-data undo and completed and partially completed transactions. Tux3 does not use undo logging, only redo, so a transaction is complete as soon as there is enough information on durable media to replay the redo records. > And there will have to be a tie-in between the meta-data commits and > the transactions so you know which ones have to be replayed. That > sounds fairly hairy. Have you figured out how you are doing to do that? Yes. Each new commit records the sequence number of the oldest commit that should be replayed. So the train lays down track in front of itself and pulls it up again when the caboose passes. For now, there is just one linear sequence of commits, though I could see elaborating that to support efficient clustering. One messy detail: each forward log transaction is written into free space wherever physically convenient, but we need to be sure that that free space is not allocated for data until log rollup has proceeded past that transaction. One way to do this is to make a special check against the list of log transactions in flight at the point where extent allocation thinks it has discovered a suitable free block, which is the way ddsnap currently implements the idea. I am not sure whether I am going to stick with that method for Tux3 or just update the disk image of the free tree to include the log transaction blocks and somehow avoid logging those particular free tree changes to disk. Hmm, a choice of two ugly but workable methods, but thankfully neither affects the disk image. > :Once I get down to the leaf level I binary search on logical address in > :the case of a file index btree or on version in the case of an inode > :table block, so this cost is still Log(N) with a small k. For a > :heavily versioned inode or file region this might sometimes result in > :an overflow block or two that has to be linearly searched which is not > :a big problem so long as it is a rare case, which it really ought to be > :for the kinds of filesystem loads I have seen. A common example of a > :worst case is /var/log/messages, where the mtime and size are going to > :change in pretty much every version, so if you have hourly snapshots > :and hold them for three months it adds up to about 2200 16 byte inode > :table attributes to record, about 8 inode table leaf blocks. I really > :do not see that as a problem. If it goes up to 100 leaf blocks to > :search then that could be a problem. > > I think you can get away with this as long as you don't have too many > snapshots, and even if you do I noticed with HAMMER that only a small > percentage of inodes have a large number of versions associated with > them from normal production operation. Yes, that is my expectation. I think everything will perform fine up to a few hundred snapshots without special optimization, which is way beyond current expectations. Most users only recently upgraded to journalling filesystems let alone having any exposure to snapshots at all. > /var/log/messages > is an excellent example of that. Log files were effected the most though > I also noticed that very large files also wind up with multiple versions > of the inode, such as when writing out a terrabyte-sized file. Right, when writing the file takes longer than the snapshot interval. > Even with a direct bypass for data blocks (but not their meta-data, > clearly), HAMMER could only cache so much meta-data in memory before > it had to finalize the topology and flush the inode out. A > terrabyte-sized file wound up with about 1000 copies of the inode > prior to pruning (one had to be written out about every gigabyte or so). For Tux3 with an hourly snapshot schedule that will only be four or five versions of the [mtime, size] attribute to reflect the 4.7 hours write time or so, and just one version of each block pointer. > :The penultimate inode table index block tells me how many blocks a > :given inode lives in because several blocks will have the same inum > :key. So the lookup algorithm for a massively versioned file becomes: > :1) read the first inode table block holding that inum; 2) read the last > :block with the same inum. The latter operation only needs to consult > :the immediate parent index block, which is locked in the page cache at > :that point. > > How are you dealing with expansion of the logical inode block(s) as > new versions are added? I'm assuming you are intending to pack the > inodes on the media so e.g. a 128-byte inode would only take up > 128 bytes of media space in the best case. Multiple inodes would be > laid out next to each other logically (I assume), but since the physical > blocks are larger they would also have to be laid out next to each > other physically within any given backing block. Now what happens > when one has to be expanded? The inode table block is split at a boundary between inodes. An "inode" is broken up into attribute groups arranged so that it makes sense to update or version all the members of an attribute group together.  The "standard" attribute group consists of mode, uid, gid, ctime and version, which adds up to 16 bytes.  The smallest empty inode (touch foo) is 40 bytes and a file with "foo" in it as immediate data is 64 bytes.  This has a standard attribute, a link count attribute, a size/mtime attribute, and an immediate data attribute, all versioned. An inode with heavily versioned attributes might overflow into the next btree leaf as described elsewhere. Free inodes lying between active inodes in the same leaf use two bytes each, a consequence of the inode leaf directory, which has a table at the top of the leaf of two byte pointers giving the offset of each inode in the leaf, stored backwards and growing down towards the inode attributes. (This is a slight evolution of the ddsnap leaf format.) > I'm sure this ties into the forward-log but even with the best algorithms > you are going to hit limited cases where you have to expand the inode. > Are you just copying the inode block(s) into a new physical allocation > then? Split the inode in memory, allocating a new buffer; forward log the two new pieces physically out to disk with a logical record in the commit block recording where the two new pointers are to be inserted without actually inserting them until a logical rollup episode is triggered. > :> I couldn't get away from having a delete_tid (the 'death version > :> numbers'). I really tried :-) There are many cases where one is only > :> deleting, rather then overwriting. > : > :By far the most common case I would think. But check out the versioned > :pointer algorithms. Surprisingly that just works, which is not exactly > :obvious: > : > : Versioned pointers: a new method of representing snapshots > : http://lwn.net/Articles/288896/ > > Yes, it makes sense. If the snapshot is explicitly taken then you > can store direct references and chain, and you wouldn't need a delete > id in that case. From that article though the chain looks fairly > linear. Historical access could wind up being rather costly. I can think of three common cases of files that get a lot of historical modifications: * Append log - The first iteration in each new version generates a new size attribute * Truncate/rewrite - The first iteration in each new version generates a new size attribute and either a new file index root or a new immediate data attribute * Database - The first file change in each new version generates a new versioned pointer at the corresponding logical address in the file btree and a new size attribute (just because mtime is bundled together with the size in one attribute group). So the only real proliferation is the size/mtime attributes, which gets back to what I was thinking about providing quick access for the "current" version (whatever that means). > :I was originally planning to keep all versions of a truncate/rewrite > :file in the same file index, but recently I realized that that is dumb > :because there will never be any file data shared by a successor version > :in that case. So the thing to do is just create an entirely new > :versioned file data attribute for each rewrite, bulking up the inode > :table entry a little but greatly constraining the search for versions > :to delete and reducing cache pressure by not loading unrelated version > :data when traversing a file. > > When truncating to 0 I would agree with your assessment. For the > topology you are using you would definitely want to use different > file data sets. You also have to deal with truncations that are not > to 0, that might be to the middle of a file. Certainly not a > common case, but still a case that has to be coded for. If you treat > truncation to 0 as a special case you will be adding considerable > complexity to the algorithm. Yes, the proposal is to treat truncation to exactly zero as a special case. It is a tiny amount of extra code for what is arguably the most common file rewrite case. > With HAMMER I chose to keep everything in one B-Tree, whether historical > or current. That way one set of algorithms handles both cases and code > complexity is greatly reduced. It isn't optimal... large amounts of > built up history still slow things down (though in a bounded fashion). > In that regard creating a separate topology for snapshots is a good > idea. Essentially I chose the same strategy, except that I have the file trees descending from the inode table instead of stretching out to the side. I think this gives a more compact tree overall, and since I am using just one generic set of btree operations to handle these two variant btrees, additional code complexity is minimal. > :> Both numbers are then needed to > :> be able to properly present a historical view of the filesystem. > :> For example, when one is deleting a directory entry a snapshot after > :> that deletion must show the directory entry gone. > : > :...which happens in Tux3 courtesy of the fact that the entire block > :containing the dirent will have been versioned, with the new version > :showing the entry gone. Here is one of two places where I violate my > :vow to avoid copying an entire block when only one data item in it > :changes (the other being the atime table). I rely on two things to > :make this nice: 1) Most dirent changes will be logically logged and > :only rolled up into the versioned file blocks when there are enough to > :be reasonably sure that each changed directory block will be hit > :numerous times in each rollup episode. (Storing the directory blocks > :in dirent-create order as in PHTree makes this very likely for mass > :deletes.) 2) When we care about this is usually during a mass delete, > :where most or all dirents in each directory file block are removed > :before moving on to the next block. > > This could wind up being a sticky issue for your implementation. > I like the concept of using the forward-log but if you actually have > to do a version copy of the directory at all you will have to update the > link (or some sort of) count for all the related inodes to keep track > of inode visibility, and to determine when the inode can be freed and > its storage space recovered. There is a versioned link count attribute in the inode. When a link attribute goes to zero for a particular version it is removed from the inode. When there are no more link attributes left, that means no directory block references the inode for any version and the inode may be reused. Note: one slightly bizarre property of this scheme is that an inode can be reused in any version in which its link count is zero, and the data blocks referenced by different versions in the same file index can be completely unrelated. I doubt there is a use for that. > Directories in HAMMER are just B-Tree elements. One element per > directory-entry. There are no directory blocks. You may want to > consider using a similar mechanism. For one thing, it makes lookups > utterly trivial... the file name is hashed and a lookup is performed > based on the hash key, then B-Tree elements with the same hash key > are iterated until a match is found (usually the first element is the > match). Also, when adding or removing directory entries only the > directory inode's mtime field needs to be updated. It's size does not. Ext2 dirents are 8 bytes + name + round up to 4 bytes, very tough to beat that compactness. We have learned through bitter experience that anything other than an Ext2/UFS style physically stable block of dirents makes it difficult to support NFS telldir cookies accurately because NFS vs gives us only a 31 bit cookie to work with, and that is not enough to store a cursor for, say, a hash order directory traversal. This is the main reason that I have decided to go back to basics for the Tux3 directory format, PHTree, and make it physically stable. In the PHTree directory format lookups are also trivial: the directory btree is keyed by a hash of the name, then each dirent block (typically one) that has a name with that hash is searched linearly. Dirent block pointer/hash pairs are at the btree leaves. A one million entry directory has about 5,000 dirent blocks referenced by about 1000 btree leaf blocks, in turn referenced by three btree index blocks (branching factor of 511 and 75% fullness). These blocks all tend to end up in the page cache for the directory file, so searching seldom references the file index btree. I did some back of the envelope calculations of the number of cache lines that have to be hit for a lookup by Hammer with its fat btree keys and lower branching factor, vs Tux3 with its high branching factor, small keys, small pointers, extra btree level for dirent offset stability and stupidly wasteful linear dirent search. I will not bore the list with the details, but it is obvious that Tux3/PHTree will pass Hammer in lookup speed for some size of directory because of the higher btree fanout. PHTree starts from way behind courtesy of the 32 cache lines that have to be hit on average for the linear search, amounting to more than half the CPU cost of performing a lookup in a million element directory, so the crossover point is somewhere up in the millions of entries. Thus annoyed, I cast about for a better dirent leaf format than the traditional Ext2 directory block, and found one after not too much head scratching. I reuse the same leaf directory format as for inode leaf blocks, but this time the internal offsets are sorted by lexical name order instead of inode number order. The dirents become Ext2 dirents minus the record number format, and minus the padding to four byte alignment, which does not do anything useful. Dirent inum increases to 6 bytes, balanced by saving 1.5 pad bytes on average, so the resulting structure stores about 2% fewer dirents than the Ext2 format against a probable 50% reduction in CPU latency per lookup. A fine trade indeed. The new format is physically stable and suitable for binary searching. When we need to manage space within the leaf for creating or deleting an entry, a version of the leaf directory ordered by offset can be created rapidly from the lexically sorted directory, which occupies at most about six cache lines. The resulting small structure can be cached to take advantage of the fact that most mass creates and deletes keep hitting the same dirent block repeatedly, because Tux3 hands the dirents to getdents in physical dirent order. I concluded that both Hammer and PHTree are exceptionally fast at name resolution. When I have the new dirent block format in place I expect to really hammer Hammer. But then I do not expect Hammer to stand still either ;-) > Your current directory block method could also represent a problem for > your directory inodes... adding and removing directory entries causing > size expansion or contraction could require rolling new versions > of the directory inode to update the size field. You can't hold too > much in the forward-log without some seriously indexing. Another > case to consider along with terrabyte-sized files and log files. A PHTree directory grows like an append-only file, a block at a time, though every time a new entry is added, mtime has to change, so mtime changes more often than size. They are grouped together in the same attribute, so the distinction is moot. If the directory is modified at least once after every snapshot (quite likely) then the snapshot retention policy governs how many mtime/size attributes are stored in the inode. Say the limit is 1,024 snapshots, then 16K worth of those attributes have to be stored, which once again encourages me to store the "current" mtime specially where it can be retrieved quickly. It would also be a good idea to store the file data attribute in front of the mtime/size attributes so only stat has to worry about the bulky version info, not lookup. Note that there is no such thing as revving an entire inode, only bits and pieces of it. For truly massive number of versions, the plan is to split up the version tree into subtrees, each having a thousand versions or so. For each version subtree (hmm, subversion tree...) there is a separate inode btree carrying only attributes and file data owned by members of that subtree. At most log(subtrees) of those tables need to be accessed by any versioned entity algorithm. (Note the terminology shift from versioned pointers to versioned entities to reflect the fact that the same algorithms work for both pointers and attributes.) I think this approach scales roughly to infinity, or at least to the point where log(versions) goes vertical which is essentially never. It requires a bounded amount of implementation effort, which will probably be deferred for months or years. Incidentally, PHTree directories are pretty much expand-only, which is required in order to maintain physical dirent stability. Compaction is not hard, but it has to be an admin-triggered operation so it will not collide with traversing the directory. > :> Both numbers are > :> also needed to be able to properly prune out unwanted historical data > :> from the filesystem. HAMMER's pruning algorithm (cleaning out old > :> historical data which is no longer desired) creates holes in the > :> sequence so once you start pruning out unwanted historical data > :> the delete_tid of a prior record will not match the create_tid of the > :> following one (historically speaking). > : > :Again, check out the versioned pointer algorithms. You can tell what > :can be pruned just by consulting the version tree and the create_tids > :(birth versions) for a particular logical address. Maybe the hang is > :that you do not organize the btrees by logical address (or inum in the > :case of the inode table tree). I thought you did but have not read > :closely enough to be sure. > > Yes, I see. Can you explain the versioned pointer algorithm a bit more, > it looks almost like a linear chain (say when someone is doing a daily > snapshot). It looks great for optimal access to the HEAD but it doesn't > look very optimal if you want to dive into an old snapshot. It is fine for diving into old snapshots. Lookup is always O(elements) where an element is a versioned pointer or (later) extent, which are very compact at 8 bytes each. Because the truncate/rewrite file update case is so common, you will rarely see more than one version at any given logical address in a file. A directory growing very slowly could end up with about 200 different versions of each of its final few blocks, one for each added entry. It takes on the order of a microsecond to scan through a few hundred versioned pointers to find the one that points at the version in which we are interested, and that only happens the first time the directory block is read into the page cache. After that, each reference to the block costs a couple of lookups in a page cache radix tree. The version lookup algorithm is not completely obvious: we scan through the list of pointers looking for the version label that is nearest the version being accessed on the path from that version to the root of the version tree. This potentially expensive search is accelerated using a bitmap table to know when a version is "on the path" and a pre-computed ord value for each version, to know which of the versions present and "on the path" for a given logical address is furthest from the root version. This notion of furthest from the root on the path from a given version to the root implements data inheritance, which is what yields the compact representation of versioned data, and is also what eliminates the need to explicitly represent the death version. You have discovered this too, in a simpler form. I believe that once you get the aha you will want to add versions of versions to your model. > For informational purposes: HAMMER has one B-Tree, organized using > a strict key comparison. The key is made up of several fields which > are compared in priority order: > > localization - used to localize certain data types together and > to separate pseudo filesystems created within > the filesystem. > obj_id - the object id the record is associated with. > Basically the inode number the record is > associated with. > > rec_type - the type of record, e.g. INODE, DATA, SYMLINK-DATA, > DIRECTORY-ENTRY, etc. > > key - e.g. file offset > > create_tid - the creation transaction id > > Inodes are grouped together using the localization field so if you > have a million inodes and are just stat()ing files, the stat > information is localized relative to other inodes and doesn't have to > skip file contents or data, resulting in highly localized accesses > on the storage media. > > Beyond that the B-Tree is organized by inode number and file offset. > In the case of a directory inode, the 'offset' is the hash key, so > directory entries are organized by hash key (part of the hash key is > an iterator to deal with namespace collisions). > > The structure seems to work well for both large and small files, for > ls -lR (stat()ing everything in sight) style traversals as well as > tar-like traversals where the file contents for each file is read. That is really sweet, provided that you are ok with the big keys. It was smart to go with that regular structure, giving you tons of control over everything and get the system up and running early, thus beating the competition. I think you can iterate from there to compress things and be even faster. Though if I were to presume to choose your priorities for you, it would be to make the reblocking optional. > The create_tid is the creation transaction id. Historical accesses are > always 'ASOF' a particular TID, and will access the highest create_tid > that is still <= the ASOF TID. The 'current' version of the filesystem > uses an asof TID of 0xFFFFFFFFFFFFFFFF and hence accesses the highest > create_tid. So you have to search through a range of TIDs to find the operative one for a particular version, just as I have to do with versioned pointers. Now just throw in a bitmap like I use to take the version inheritance into account and you have versioned pointers, which means snapshots of snapshots and other nice things. You're welcome :-) I think that versioned pointer lookup can be implemented in O(log(E)) where E is the number of versioned pointers (entities) at the same logical address, which would put it on a par with btree indexing, but more compact. I have sketched out an algorithm for this but I have not yet tried to implement it. > There is also a delete_tid which is used to filter out elements that > were deleted prior to the ASOF TID. There is currently an issue with > HAMMER where the fact that the delete_tid is *not* part of the B-Tree > compare can lead to iterations for strictly deleted elements, verses > replaced elements which will have a new element with a higher create_tid > that the B-Tree can seek to directly. In fact, at one point I tried > indexing on delete_tid instead of create_tid, but created one hell of a > mess in that I had to physically *move* B-Tree elements being deleted > instead of simply marking them as deleted. This is where I do not quite follow you. File contents are never really deleted in subsequent versions, they are just replaced. To truncate a file, just add or update size attribute for the target version and recover any data blocks past the truncate point that belongs only to the target version. Extended attributes and dirents can be truly deleted. To delete an extended attribute that is shared by some other version I will insert a new versioned attribute that says "this attribute is not here" for the current version, which will be inherited by its child versions. Dirent deletion is handled by updating the dirent block, possibly versioning the entire block. Should the entire block ever be deleted by a directory repacking operation (never actually happens on the vast majority of Linux systems) that will be handled as a file truncate. I see that in Hammer, the name is deleted from the btree instead, so fair enough, that is actual deletion of an object. But it could be handled alternatively by adding a "this object is not here" object, which might be enough to get rid of delete_tid entirely and just have create_tid. > :Fair enough. I have an entirely different approach to what you call > :mirroring and what I call delta replication. (I reserve the term > :mirroring to mean mindless duplication of physical media writes.) This > :method proved out well in ddsnap: > : > : http://phunq.net/ddtree?p=zumastor/.git;a=tree;h=fc5cb496fff10a2b03034fcf95122f5828149... > : > :(Sorry about the massive URL, you can blame Linus for that;-) > : > :What I do in ddsnap is compute all the blocks that differ between two > :versions and apply those to a remote volume already holding the first > :of the two versions, yielding a replica of the second version that is > :logically but not physically identical. The same idea works for a > :versioned filesystem: compute all the leaf data that differs between > :two versions, per inum, and apply the resulting delta to the > :corresponding inums in the remote replica. The main difference vs a > :ddsnap volume delta is that not all of the changes are physical blocks, > :there are also changed inode attributes, so the delta stream format > :has to be elaborated accordingly. > > Are you scanning the entire B-Tree to locate the differences? It > sounds you would have to as a fall-back, but that you could use the > forward-log to quickly locate differences if the first version is > fairly recent. I plan to scan the whole btree initially, which is what we do in ddsnap and it works fine. But by looking at the mtime attributes in the inode I can see in which versions the file data was changed and thus not have to scan most files. Eventually, building in some accelerator to be able to skip most inode leaves as well would be a nice thing to do. I will think about that in the background. > HAMMER's mirroring basically works like this: The master has synchronized > up to transaction id C, the slave has synchronized up to transaction id A. > The mirroring code does an optimized scan of the B-Tree to supply all > B-Tree elements that have been modified between transaction A and C. > Any number of mirroring targets can be synchronized to varying degrees, > and can be out of date by varying amounts (even years, frankly). > > I decided to use the B-Tree to optimize the scan. The B-Tree is > scanned and any element with either a creation or deletion transaction > id >= the slave's last synchronization point is then serialized and > piped to the slave. That is nearly identical to the ddsnap replication algorithm, and we also send the list over a pipe. Ddsnap does not deal with attributes, only volume blocks, otherwise I think we arrived at the same thing. > To avoid having to scan the entire B-Tree I perform an optimization > whereby the highest transaction id laid down at a leaf is propagated > up the B-Tree all the way to the root. This also occurs if a B-Tree > node is physically deleted (due to pruning), even if no elements are > actually present at the leaf within the transaction range. > Thus the mirroring scan is able to skip any internal node (and its > entire sub-tree) which has not been modified after the synchronization > point, and is able to identify any leaf for which real, physical deletions > have occured (in addition to logical deletions which simply set the > delete_tid field in the B-Tree element) and pass along the key range > and any remaining elements in that leaf for the target to do a > comparative scan with. Hmm. I have quite a few bits available in the inode table btree node pointers, which are currently only going to be free inode density hints. Something to think about. > This allows incremental mirroring, multiple slaves, and also allows > a mirror to go offline for months and then pop back online again > and optimally pick up where it left off. The incremental mirroring > is important, the last thing I want to do is have to scan 2 billion > B-Tree elements to do an incremental mirroring batch. Very, very nice. > The advantage is all the cool features I got by doing things that way. > The disadvantage is that the highest transaction id must be propagated > up the tree (though it isn't quite that bad because in HAMMER an entire > flush group uses the same transaction id, so we aren't constantly > repropagating new transaction id's up the same B-Tree nodes when > flushing a particular flush group). This is a job for forward logging :-) > You may want to consider something similar. I think using the > forward-log to optimize incremental mirroring operations is also fine > as long as you are willing to take the 'hit' of having to scan (though > not have to transfer) the entire B-Tree if the mirror is too far > out of date. There is a slight skew in your perception of the function of forward logging here, I think. Forward logs transactions are not long-lived disk objects, they just serve to batch together lots of little updates into a few full block "physical" updates that can be written to the disk in seek-optimized order. It still works well for this application. In short, we propagate dirty bits up the btree while updating the btree disk blocks only rarely. Instead, the node dirty bits are tucked into the commit block of the write transaction or namespace edit that caused the change, and are in turn propagated into the commit block of an upcoming physical rollup, eventually working their way up the on-disk btree image. But they are present in the cached btree image as soon as they are set, which is the structure consulted by the replication process. > :I intend to log insertions and deletions logically, which keeps each > :down to a few bytes until a btree rollup episode comes along to perform > :updating of the btree nodes in bulk. I am pretty sure this will work > :for you as well, and you might want to check out that forward logging > :trick. > > Yes. The reason I don't is because while it is really easy to lay down > a forward-log, intergrating it into lookup operations (if you don't > keep all the references cached in-memory) is really complex code. > I mean, you can always scan it backwards linearly and that certainly > is easy to do, but the filesystem's performance would be terrible. > So you have to index the log somehow to allow lookups on it in > reverse to occur reasonably optimally. Have you figured out how you > are going to do that? I hope that question is cleared up now. > With HAMMER I have an in-memory cache and the on-disk B-Tree. Just > writing the merged lookup code (merging the in-memory cache with the > on-disk B-Tree for the purposes of doing a lookup) was fairly complex. > I would hate to have to do a three-way merge.... in-memory cache, on-disk > log, AND on-disk B-Tree. Yowzer! Tux3 also has an in-memory cache and an on-disk btree, and in addition, the in-memory cache is identical to some future version of the on-disk btree, which ought to be good for losing a lot of corner cases. > :That reminds me, I was concerned about the idea of UNDO records vs > :REDO. I hope I have this right: you delay acknowledging any write > :transaction to the submitter until log commit has progressed beyond the > :associated UNDO records. Otherwise, if you acknowledge, crash, and > :prune all UNDO changes, then you would end up with applications > :believing that they had got things onto stable storage and be wrong > :about that. I have no doubt you did the right thing there, but it is > :not obvious from your design documentation. > > The way it works is that HAMMER's frontend is almost entirely > disconnected from the backend. All meta-data operations are cached > in-memory. create, delete, append, truncate, rename, write... > you name it. Nothing the frontend does modifies any meta-data > or mata-data buffers. > > The only sychronization point is fsync(), the filesystem syncer, and > of course if too much in-memory cache is built up. To improve > performance, raw data blocks are not included... space for raw data > writes is reserved by the frontend (without modifying the storage > allocation layer) and those data buffers are written to disk by the > frontend directly, just without any meta-data on-disk to reference > them so who cares if you crash then! It would be as if those blocks > were never allocated in the first place. Things are a little different in Linux. The VFS takes care of most of what you describe as your filesystem front end, and in fact, the VFS is capable of running as a filesystem entirely on its own, just by supplying a few stub methods to bypass the backing store (see ramfs). I think this is pretty much what you call your front end, though I probably missed some major functionality you provide there that the VFS does not directly provide. A filesystem backend on Linux would be the thing that implements the prepare_write/commit_write calls that come in from the one-size-fits all generic_file_buffered_write function that most filesystems use to implement write(2). These functions generally just call back into the block library, passing a filesystem-specific get_block callback which the library uses to assign physical disk sector addresses to buffers attached to the page. Ext3 opens a journal transaction in prepare_write and finishes it in commit_write. Not much of a transaction if you ask me. Tux3 wants to do things in a more direct way and in bigger chunks. Actually, all Tux3 needs to do with the prepare_write/commit_write calls that generic_file_buffered_write throws at it is remember which inodes where involved, because the inode keeps a list of dirty pages, the same ones that prepare_write/commit_write is trying to tell us about. When there are enough of them, on one of the commit_write calls Tux3 can go delving into its file index btree to find or create a place to store some. This will generate changes to the btree that must be logged logically in the commit block of the write transaction that is being set up, which now gets its sequence number so that the logical changes can be applied in the same order if replay is needed. At the point Tux3 is asked to generate some new logical changes, it may decide that it has already sent enough logical changes onto disk and now would be a good time to roll some of them up. The rolled up block images are sitting in the buffer cache where Tux3 has prudently locked them by incrementing the page use count. Tux3 now sets up some physical log transactions to write out the images. Before doing so, it copies each image to a new location in the buffer cache and modifies pointers in the parent images to point at the new images (possibly having to read in the parents first). This generates new logical changes (the changes to the parent nodes and changes to the free tree to allocate the new buffer cache positions) which will be logically logged in the commit blocks of the physical transaction about to be committed. When the rollup transactions have been sent to disk, Tux3 can continue allocating disk space etc by updating the new images of the disk blocks and setting up new transactions, but no transactions that depend on the rolled up state of the disk blocks can be allowed to complete until the rollup transactions have completed, which Tux3 learns of by counting the interrupt mode ->endio calls that come back from the bio transfers. Note: no other filesystem on Linux current works this way. They all pretty much rely on the block IO library which implements the fairly goofy one-block-at-a-time transfer regimen. The idea of setting up bio transfers directly from the filesystem is unfortunately something new. We should have been doing this for a long time, because the interface is quite elegant, and actually, that is just what the block IO library is doing many levels down below. It is time to strip away some levels and just do what needs to be done in the way it ought to be done. Notice how recursive the whole idea is. Logical transactions cause physical transactions which cause logical transactions etc. I am taking it on faith that the recursion terminates properly and does not get into cycles. Over time I will give concrete reasons why I think it all just works. Digression done, back to the Hammer algorithms... > When the backend decides to flush the cached meta-data ops it breaks > the meta-data ops into flush-groups whos dirty meta-data fits in the > system's buffer cache. The meta-data ops are executed, building the > UNDO, updating the B-Tree, allocating or finalizing the storage, and > modifying the meta-data buffers in the buffer cache. BUT the dirty > meta-data buffers are locked into memory and NOT yet flushed to > the media. The UNDO *is* flushed to the media, so the flush groups > can build a lot of UNDO up and flush it as they go if necessary. Hmm, and I swear I did not write the above before reading this paragraph. Many identical ideas there. > When the flush group has completed any remaining UNDO is flushed to the > media, we wait for I/O to complete, the volume header's UNDO FIFO index > is updated and written out, we wait for THAT I/O to complete, *then* > the dirty meta-data buffers are flushed to the media. The flushes at > the end are done asynchronously (unless fsync()ing) and can overlap with > the flushes done at the beginning of the next flush group. So there > are exactly two physical synchronization points for each flush. > > If a crash occurs at any point, upon remounting after a reboot > HAMMER only needs to run the UNDOs to undo any partially committed > meta-data. > > That's it. It is fairly straight forward. > > For your forward-log approach you would have to log the operations > as they occur, which is fine since that can be cached in-memory. > However, you will still need to synchronize the log entries with > a volume header update to update the log's index, so the recovery > code knows how far the log extends. Plus you would also have to log > UNDO records when making actual changes to the permanent media > structures (your B-Trees), which is independant of the forward-log > entries you made representing high level filesystem operations, and > would also have to lock the related meta-data in memory until the > related log entries can be synchronized. Then you would be able > to flush the meta-data buffers. That is pretty close to the mark, but there are UNDO records in Tux3, there are only logical and physical REDO records. The forward log is supposed to try to avoid syncing the beginning of the chain to a known location as much as possible, which it does by having at least one transaction in its pipeline that has not been committed to disk yet and is part of an existing forward chain. The chain is extended by allocating a location for the commit block of a new transaction being formed, storing that location in the commit block of the existing transaction, then committing the existing transaction to disk. > The forward-log approach is definitely more fine-grained, particularly > for fsync() operations... those would go much faster using a forward > log then the mechanism I use because only the forward-log would have > to be synchronized (not the meta-data changes) to 'commit' the work. > I like that, it would be a definite advantage for database operations. I think you will like it even more when you realize that updating the filesystem header is not required for most transactions. > :> HAMMER's B-Tree elements are probably huge compared to Tux3, and that's > :> another limiting factor for the fan-out I can use. My elements > :> are 64 bytes each. > : > :Yes, I mostly have 16 byte elements and am working on getting most of > :them down to 12 or 8. > > I don't know how you can make them that small. I spent months just > getting my elements down to 64 bytes. The data reference alone for > data blocks is 12 bytes (64 bit media storage offset and 32 bit length). A versioned extent: struct extent { unsigned version:10, count:6, block:24; }; Directory format to index the extent within a leaf: struct entry { unsigned loglo:24, offset:8 }; struct group { unsigned loghi:24, count:8 }; Leaf layout (abridged): struct leaf { unsigned groups; struct group[]; struct entry[]; struct elements[] }; Extents are divided into groups, all of which have the same upper 24 bits of logical address. The 8 bit offset gives the position of the first extent with that logical offset, measured in 8 byte units from the base of the extent group. The difference between two successive offset bytes is the number of extents with the same logical offset. The first offset byte gives the total size of the group, and would otherwise be always be zero and that would be a waste. The base of the ith group is determined by totalling the zeroth offset bytes. The 48 bit logical address is split between the 32 bit dictionary and the 32 bit group entry. This format gives the ability to binsearch on both entries and groups. The number of groups per leaf should be just one or two at deep leaves where considerable splitting of the logical address has already occurred, so the overhead of the dictionary is just 32 bits per extent roughly, giving a total overhead of 12 bytes per extent. Shallow btrees will approach 16 bytes per extent, but the tree is shallow anyway so it does not matter. The shallowest tree consists of just a few extents recorded as an attribute in the inode btree. I introduce the rule that for such an extent attribute, the extents belonging to each version are presumed to be logically contiguous, so no directory is needed and the extent overhead drops to 8 bytes in this common case. Actually, most files should just consist of a single extent. That could be a new attribute form, a size/extent attribute, with the mtime assumed to be the same as the ctime, which I think would cover a majority of files on the system I am using right now. See, I am really obsessive when it comes to saving bits. None of these compression hacks costs a lot of code or has a big chance of hiding a bug, because the extra boundary conditions are strictly local. > I think the cpu overhead is going to be a bit worse then you are > contemplating, but your access footprint (with regards to system memory > use caching the meta-data) for non-historical accesses will be better. We shall see. Factors working in my favor are: * Snapshots are explicit, so you have to create lots of them plus heavily update the data in order to make the versioned data bushy. * The linux page cache not only caches data but provides an efficient access path to logical blocks that avoids having to consult the filesystem metadata repeatedly under heavy traffic. * Compact metadata entities including versioned attributes and extents (pointers for now) * The kind of load that people put on a Netapp is a trivially small number of versions for Tux3. > Are you going to use a B-Tree for the per-file layer or a blockmap? Btree, which I call the file index btree. > And how are you going to optimize the storage for small files? Immediate data is allowed in an inode as a versioned attribute. > Are you just going to leave them in the log and not push a B-Tree > for them? It did not occur to me to leave them in the logical log, but yes that is a nice optimization. They will only stay there for a while, until a rollup is triggered, moving them into the inode table proper. It gets a little tricky to figure out whether the amortization due to the logical logging step is worth the extra write. There is a crossover somewhere, maybe around a hundred bytes or so, ideal for symlinks and acls. > :I might eventually add some explicit cursor caching, but various > :artists over the years have noticed that it does not make as much > :difference as you might think. > > For uncacheable data sets the cpu overhead is almost irrelevant. For rotating media, true. There is also flash, and Ramback... > But for cached data sets, watch out! The cpu overhead of your B-Tree > lookups is going to be 50% of your cpu time, with the other 50% being > the memory copy or memory mapping operation and buffer cache operations. > It is really horrendous. When someone is read()ing or write()ing a > large file the last thing you want to do is traverse the same 4 nodes > and do four binary searches in each of those nodes for every read(). I found that from my HTree experience. I found that pretty much the entire cost of the bad old linear dirops was CPU and I was able to show a ridiculous speedup by cutting the cost down to O(log511(ops)) from O(ops^2). > For large fully cached data sets not caching B-Tree cursors will > strip away 20-30% of your performance once your B-Tree depth exceeds > 4 or 5. Also, the fan-out does not necessarily help there because > the search within the B-Tree node costs almost as much as moving > between B-Tree nodes. > > I found this out when I started comparing HAMMER performance with > UFS. For the fully cached case UFS was 30% faster until I started > caching B-Tree cursors. It was definitely noticable once my B-Tree > grew past a million elements or so. It disappeared completely when > I started caching cursors into the B-Tree. OK, you convinced me. I already have btree cursors (see ddsnap path[]) that are so far only used for per operation btree access and editing. So these will eventually become cache objects. > :> If I were to do radix compression I would also want to go with a > :> fully dynamic element size and fully dynamic fan-out in order to > :> best-compress each B-Tree node. Definitely food for thought. > : > :Indeed. But Linux is braindamaged about large block size, so there is > :very strong motivation to stay within physical page size for the > :immediate future. Perhaps if I get around to a certain hack that has > :been perenially delayed, that situation will improve: > : > : "Variable sized page objects" > : http://lwn.net/Articles/37795/ > > I think it will be an issue for people trying to port HAMMER. I'm trying > to think of ways to deal with it. Anyone doing an initial port can > just drop all the blocks down to 16K, removing the problem but > increasing the overhead when working with large files. Do the variable sized page hack :-] Or alternatively look at the XFS buffer cache shim layer, which Christoph manhandled into kernel after years of XFS not being accepted into mainline because of it. > :> I'd love to do something like that. I think radix compression would > :> remove much of the topological bloat the B-Tree creates verses using > :> blockmaps, generally speaking. > : > :Topological bloat? > > Bytes per record. e.g. the cost of creating a small file in HAMMER > is 3 B-Tree records (directory entry + inode record + one data record), > plus the inode data, plus the file data. For HAMMER that is 64*3 + > 128 + 112 (say the file is 100 bytes long, round up to a 16 byte > boundary)... so that is 432 bytes. Let me see, for Tux3 that is 64 bytes for the inode part including a tiny amount of data, plus 32 bytes or so for the dirent part, say 100 bytes. This is my reward for being an obsessive bit mizer. > The bigger cost is when creating and managing a large file. A 1 gigabyte > file in HAMMER requires 1G/65536 = 16384 B-Tree elements, which comes > to 1 megabyte of meta-data. If I were to radix-compress those > elements the meta-data overhead would probably be cut to 300KB, > possibly even less. > > Where this matters is that it directly effects the meta-data footprint > in the system caches which in turn directly effects the filesystem's > ability to cache information without having to go to disk. It can > be a big deal. With pointers it is about a megabyte per gigabyte for Tux3 too (and Ext2 and Ext3) so going to extents is not optional. I plan a limit of 64 blocks per extent, cutting the metadata overhead down to 16K per gigabyte, which is hard to complain about. Larger extents in the file index would not buy much and would mess up the nice, 8 byte one-size-fits-all versioned extent design. > Adding a forward log to HAMMER is possible, I might do it just for > quick write()/fsync() style operations. I am still very wary of the > added code complexity. Waiting for me to prove the idea seems reasonable. > :He did remind me (via time travel from year 2000) of some details I > :should write into the design explicitly, for example, logging orphan > :inodes that are unlinked while open, so they can be deleted on replay > :after a crash. Another nice application of forward logging, which > :avoids the seek-happy linked list through the inode table that Ext3 > :does. > > Orphan inodes in HAMMER will always be committed to disk with a > 0 link count. The pruner deals with them after a crash. Orphan > inodes can also be commited due to directory entry dependancies. That is nice, but since I want to run without a pruner I had better stick with the logical logging plan. The link count of the inode will also be committed as zero, or rather, the link attribute will be entirely missing at that point, which allows for a cross check. > The budding operations is very interesting to me... > well, the ability to fork the filesystem and effectively write to > a snapshot. I'd love to be able to do that, it is a far superior > mechanism to taking a snapshot and then performing a rollback later. I was thinking more about the operation of taking somebody's home directory and turning it into a separate volume, for whatever reason. Tux3 snapshots (versions) have that write/rollback ability, and also have it for snapshots of snapshots, just in case that was not clear. > Hmm. I could definitely manage more then one B-Tree if I wanted to. > That might be the ticket for HAMMER... use the existing snapshot > mechanic as backing store and a separate B-Tree to hold all changes > made to the snapshot, then do a merged lookup between the new B-Tree > and the old B-Tree. That would indeed work. Yes, it sounds nice. > I can tell you've been thinking about Tux for a long time. If I > had one worry about your proposed implementation it would be in the > area of algorithmic complexity. You have to deal with the in-memory > cache, the log, the B-Tree, plus secondary indexing for snapshotted > elements and a ton of special cases all over the place. Your general > lookup code is going to be very, very complex. The lookup path is actually quite simple, I am far enough in to know that. The versioning algorithms are the most complex piece from an algorithmic point of view, and as you can see from the sample implementation, they ha
From: Matthew Dillon Subject: Re: [Tux3] Comparison to Hammer fs design Date: Jul 27, 2:31 pm 2008 :When a forward commit block is actually written it contains a sequence :number and a hash of its transaction in order to know whether the :... :Note: I am well aware that a debate will ensue about whether there is :any such thing as "acceptable risk" in relying on a hash to know if a :commit has completed. This occurred in the case of Graydon Hoare's :Monotone version control system and continues to this day, but the fact :is, the cool modern version control systems such as Git and Mercurial :now rely very successfully on such hashes. Nonetheless, the debate :will keep going, possibly as FUD from parties who just plain want to :use some other filesystem for their own reasons. To quell that :definitively I need a mount option that avoids all such commit risk, :perhaps by providing modest sized journal areas salted throughout the :volume whose sole purpose is to record log commit blocks, which then :are not forward. Only slightly less efficient than forward logging :and better than journalling, which has to seek far away to the journal :and has to provide journal space for the biggest possible journal :transaction as opposed to the most commit blocks needed for the largest :possible VFS transaction (probably one). Well, you've described both sides of the debate quite well. I for one am in the 0-risk camp. ReiserFS got into trouble depending on a collision space (albeit a small one). I did extensive testing of 64 bit collision spaces when I wrote Diablo (A USENET news system contemporary with INN, over 10 years ago) and even with a 64 bit space collisions could occur quite often simply due to the volume of article traffic. I think the cost can be reduced to the point where there's no need to allow any risk at all. After all, we're only talking about meta-data updates here for the most part. Even under heavy loads it can take 30 seconds for enough dirty meta-data to build up to require a flush. One can flush the mess, then seek/update the volume header. Or doing something like using a fixed LOG area, allowing it to be preformatted... that would remove 100% of the risk and not require a volume header update (drafting note: edited up from further below). I know volume headers seem old-school. It was a quick and dirty solution in HAMMER that I will be changing. :Actually, the btree node images are kept fully up to date in the page :cache which is the only way the high level filesystem code accesses :them. They do not reflect exactly what is on the disk, but they do :reflect exactly what would be on disk if all the logs were fully :rolled up ("replay"). Ok, this is simpler. Hmm. That brings up the question with regards to data allocations and the log. Are you going to use a fixed log space and allocate data-store separately or is the log a continuously running pointer across the disk? My understanding is it is a continuously running space. :A forward log that carries the edits to some dirty cache block pins :that dirty block in memory and must be rolled up into a physical log :before the cache block can be flushed to disk. Fortunately, such a :rollup requires only a predictable amount of memory: space to load :enough of the free tree to allocate space for the rollup log, enough :... Ok, here I spent about 30 minutes constructing a followup but then you answered some of the points later on, so what I am going to do is roll it up into a followup on one of your later points :-) :One traditional nasty case that becomes really nice with logical :forward logging is truncate of a gigantic file. We just need to :commit a logical update like ['resize', inum, 0] then the inode data :truncate can proceed as convenient. Another is orphan inode handling :where an open file has been completely unlinked, in which case we :log the logical change ['free', inum] then proceed with the actual :delete when the file is closed or when the log is replayed after a :surprise reboot. :... :Logical log replay is not idempotent, so special care has to be taken :on replay to ensure that specified changes have not already been Wait, it isn't? I thought it was. I think it has to be because the related physical B-Tree modifications required can be unbounded, and because physical B-Tree modifications are occuring in parallel the related physical operations cause the logical operations to become bound together, meaning the logical ops *cannot* be independantly backed out Here's an example: rm a/b/c <---- 1TB file rmdir a/b rmdir a There are three problems here. First, the logical log must be idempotent, or at least mostly idempotent, because the physical portions of the three separate logical transactions will complete at different times. Second, the logical log entry for "rm a/b/c" cannot be destroyed (due to log cycling based on available free space) until after the related physical operation has completed, which could occur seconds to minutes later. The second issue could become a big issue for you because it means the log space required will become unbounded or nearly unbounded. If the remove requires 10's to 100's of megabytes of physical log entries, all appending to the log, and the physical blocks backing the log cannot be recovered due to that 'rm a/b/c' needing to remain in the log until the physical operation is complete, then I think you have a problem. You would probably need to special-case it. A third issue effecting the amount of free space required to complete an operation is going to be the worst case physical log space required for a single low-level B-Tree transaction, which will roughly be related to the depth of the B-Tree, multiplied by the number of modifying B-Tree operations being done in parallel (e.g. from many individual processes). :... :ugly or unreliable when there are lot of stacked changes. Instead I :introduce the rule that a logical change can only be applied to a known :good version of the target object, which promise is fullfilled via the :physical logging layer. :... :then apply the logical changes to it there. Where interdependencies :exist between updates, for example the free tree should be updated to :reflect a block freed by merging two btree nodes, the entire collection :of logical and physical changes has to be replayed in topologically :sorted order, the details of which I have not thought much about other :than to notice it is always possible. I came up with a really simple solution to the freemap-reuse issue which you could apply to allow such replays to be idempotent. You simply do not allow a freed block to be reallocated until after all the information related to the logical operation that freed it has been solidly committed. Thus you can replay the low level frees without accidently re-freeing a block that has been reallocated. In HAMMER I delay reuse of freed blocks because UNDO's are only for meta-data, not file data, and thus freed data blocks cannot be reused until after the related freeing operations has been completely committed so crash recovery can do the rollback. You could use them to make otherwise non-idempotent operations idempotent. :When replay is completed, we have a number of dirty cache blocks which :are identical to the unflushed cache blocks at the time of a crash, :and we have not yet flushed any of those to disk. (I suppose this gets :interesting and probably does need some paranoid flushing logic in :replay to handle the bizarre case where a user replays on a smaller :memory configuration than they crashed on.) The thing is, replay :returns the filesystem to the logical state it was in when the crash :happened. This is a detail that journalling filesystem authors tend :to overlook: actually flushing out the result of the replay is :pointless and only obscures the essential logic. Think about a crash :during replay, what good has the flush done? Well, you have to flush at some point anyway because you only have so much kernel memory to play with.... You do not have enough kernel memory to cache the dirty buffers for all the physical activity related to a particular logical operation (such as truncating a 1TB file). :I do not see why this example cannot be logically logged in pieces: : : ['new', inum_a, mode etc] ['link', inum_parent, inum_a, "a"] : ['new', inum_b, mode etc] ['link' inum_a, inum_b, "b"] : ['new', inum_c, mode etc] ['link' inum_a_b, inum_c, "c"] : ['new', inum_x, mode etc] ['link' inum_a_b, inum_x, "x"] : ['new', inum_d, mode etc] ['link' inum_a_b_c, inum_d, "d"] : :Logical updates on one line are in the same logical commit. Logical :allocations of blocks to record the possibly split btree leaves and new :allocations omitted for clarity. The omitted logical updates are :bounded by the depth of the btrees. To keep things simple, the logical :log format should be such that it is impossible to overflow one commit :block with the updates required to represent a single vfs level :transaction. Hmm. Maybe. This helps you identify dependancies that must be replayed, but it might be easier just to make the entire logical log idempotent and just replay everything. I think the only gotcha there is handling link counts. That would still be far less complex then the alternative. :I suspect there may be some terminology skew re the term "transaction". :Tux3 understands this as "VFS transaction", which does not include :trying to make an entire write (2) call atomic for example, but only :such things as allocate+write for a single page cache page or :allocate+link for a sys_link call. Fsync(2) is not a transaction but :a barrier that Tux3 is free to realize via multiple VFS transactions, :which the Linux VFS now takes care of pretty well now after many years :of patching up the logic. I kinda mix them up. I think of them as transactions at different levels of granularity. :... :the resulting blocks are logged logically, but linked into the parent :btree block using a logical update stored in the commit block of the :physical log transaction. How cool is that? That is definitely cool. :transactions are not just throwaway things, they are the actual new :data. Only the commit block is discarded, which I suppose will leave :a lot of one block holes around the volume, but then I do not have to :require that the commit block be immediately adjacent to the body of :the transaction, which will allow me to get good value out of such :holes. On modern rotating media, strictly linear transfers are not :that much more efficient than several discontiguous transfers that all :land relatively close to each other. True enough. Those holes will create significant fragmentation once you cycle through available space on the media, though. :> So your crash recovery code will have to handle=20 :> both meta-data undo and completed and partially completed transaction= :s. : :Tux3 does not use undo logging, only redo, so a transaction is complete :as soon as there is enough information on durable media to replay the :redo records. Ok, that works. Be careful w/ your physical logging, it can get expensive. I think it may be more expensive then you are currently estimating. One of the reasons why HAMMER uses an UNDO log is that it is a lot less expensive then a REDO log. I'll show you why: Physical operations related to a single flush group in HAMMER: modify element in B-Tree node A (create UNDO w/ original contents of B-Tree node A) modify element in B-Tree node A (no new UNDO needs to be created) modify element in B-Tree node A (no new UNDO needs to be created) modify element in B-Tree node A (no new UNDO needs to be created) modify element in B-Tree node A (no new UNDO needs to be created) (NOTE: I do realize that the REDO log can be compressed just as the UNDO one, by recording actual data shifts from B-Tree insertions and deletions, it gets really complex when you do that, though, and would not be idempotent). See the advantage? Many B-Tree operations working on the same node, or revisiting that node later on, within the same flush group, do not have to lay down more then one UNDO record for that B-Tree node. The physical ops are modifying already modified space so I don't have to record each op in the UNDO log. Upon crash recovery the whole mess is undone by that first UNDO record. I just run the UNDOs backwards, which allows me to cache the in-flush undo areas with a fixed set of small throw-away tracking structures in system memory. If a structure gets thrown away it just means an extranious UNDO record will be appended to the UNDO FIFO. :Yes. Each new commit records the sequence number of the oldest commit :that should be replayed. So the train lays down track in front of :itself and pulls it up again when the caboose passes. For now, there :is just one linear sequence of commits, though I could see elaborating :that to support efficient clustering. Yah. I see. Noting that issue I brought up earlier about the "rm a/b/c". Locking the caboose of your log until all related physical operations have been completed could create a problem. :One messy detail: each forward log transaction is written into free :space wherever physically convenient, but we need to be sure that that :free space is not allocated for data until log rollup has proceeded :past that transaction. One way to do this is to make a special check :against the list of log transactions in flight at the point where :extent allocation thinks it has discovered a suitable free block, which :is the way ddsnap currently implements the idea. I am not sure whether :I am going to stick with that method for Tux3 or just update the disk :image of the free tree to include the log transaction blocks and :somehow avoid logging those particular free tree changes to disk. Hmm, :a choice of two ugly but workable methods, but thankfully neither :affects the disk image. You might be able to adapt the delayed-reuse concept I used in HAMMER to simplify the problem, possibly to the point of making those operations replayable. Another thing I did in HAMMER was reserve space on the frontend for data blocks without modifying the freemap at all. This allows the frontend to write file data blocks to the physical media without having to wait for synchronization with meta-data or the backend or anything. I keep track of the reserved space with small in-memory structures (whos function is to prevent other allocations from using the reserved space until it can be committed). Freemap updates are then committed in the same flush that commits the related B-Tree insertions and deletions. Thus the freemap is kept in sync. :... :> I also noticed that very large files also wind up with multiple versi= :ons :> of the inode, such as when writing out a terrabyte-sized file.=20 : :Right, when writing the file takes longer than the snapshot interval. Yah. It isn't as big a deal when explicit snapshots are taken. It is an issue for HAMMER because the history is more of a continuum. Effectively the system syncer gives us a 30-60 second granular snapshot without having to lift a finger (which incidently also makes having an 'undo' utility feasible). :So the only real proliferation is the size/mtime attributes, which gets :back to what I was thinking about providing quick access for the :"current" version (whatever that means). st_size is the one thing we can't get away from. For HAMMER, mtime updates don't roll new inodes and both mtime and atime are locked to the ctime when accessed via a snapshot (so one can use a (tar | md5) as a means of validating the integrity of the snapshot). :Note: one slightly bizarre property of this scheme is that an inode :can be reused in any version in which its link count is zero, and the :data blocks referenced by different versions in the same file index :can be completely unrelated. I doubt there is a use for that. It might complicate mirroring / clustering-style operations, but you may not be targetting that sort of thing. HAMMER is intended to become a cluster filesystem so I use 64 bit inode numbers which are never reused for the entire life of the filesystem. :Ext2 dirents are 8 bytes + name + round up to 4 bytes, very tough to :beat that compactness. We have learned through bitter experience that :anything other than an Ext2/UFS style physically stable block of :dirents makes it difficult to support NFS telldir cookies accurately :because NFS vs gives us only a 31 bit cookie to work with, and that is :not enough to store a cursor for, say, a hash order directory :traversal. This is the main reason that I have decided to go back to :basics for the Tux3 directory format, PHTree, and make it physically :stable. The cookies are 64 bits in DragonFly. I'm not sure why Linux would still be using 32 bit cookies, file offsets are 64 bits so you should be able to use 64 bit cookies. For NFS in DragonFly I use a 64 bit cookie where 32 bits is a hash key and 32 bits is an iterator to deal with hash collisions. Poof, problem solved. :In the PHTree directory format lookups are also trivial: the directory :btree is keyed by a hash of the name, then each dirent block (typically :one) that has a name with that hash is searched linearly. Dirent block :pointer/hash pairs are at the btree leaves. A one million entry :directory has about 5,000 dirent blocks referenced by about 1000 btree :leaf blocks, in turn referenced by three btree index blocks (branching :factor of 511 and 75% fullness). These blocks all tend to end up in :the page cache for the directory file, so searching seldom references :the file index btree. That seems reasonable. I'll look up the PHTree stuff, it sounds interesting. Personally speaking I don't mind eating 64 bytes per directory entry. I can accomodate file names up to 16 bytes in the B-Tree element itself so 95% of the namespace doesn't even require a look-aside data reference. The one mistake I made, which I didn't have time to fix in the first HAMMER release, is that my directory hash is a standard crc which means that records in the B-Tree aren't even close to being in any sort of lexical order. I will probably wind up changing the hash calculation a bit to get some lexical sorting information stuffed into it... should be doable since the hash keys are 64 bits. :will pass Hammer in lookup speed for some size of directory because of :the higher btree fanout. PHTree starts from way behind courtesy of the :32 cache lines that have to be hit on average for the linear search, :amounting to more than half the CPU cost of performing a lookup in a :million element directory, so the crossover point is somewhere up in :the millions of entries. Well, the B-Tree fanout isn't actually that big a deal. Remember HAMMER reblocks B-Tree nodes along with everything else. B-Tree nodes in a traversal are going to wind up in the same 16K filesystem block. That leaves just element space overhead, which frankly is not much of an issue on a production system because kernels have to cache a lot more then just the directory entry. The in-memory vnode and inode structures take up a lot more space then the cached directory entry. So even though HAMMER's directory entry is a 64-byte B-Tree element and roughly double the overhead of a UFS-style directory entry, the overall cache footprint in the system is only ~20% worse. :.. :>=20 :> Beyond that the B-Tree is organized by inode number and file offset. :> In the case of a directory inode, the 'offset' is the hash key, so :> directory entries are organized by hash key (part of the hash key is :> an iterator to deal with namespace collisions). :>=20 :> The structure seems to work well for both large and small files, for :> ls -lR (stat()ing everything in sight) style traversals as well as=20 :> tar-like traversals where the file contents for each file is read. : :That is really sweet, provided that you are ok with the big keys. It :was smart to go with that regular structure, giving you tons of control :over everything and get the system up and running early, thus beating the :competition. I think you can iterate from there to compress things and :be even faster. Though if I were to presume to choose your priorities :for you, it would be to make the reblocking optional. Yes, my thoughts exactly. Using a fairly large but extremely flexible B-Tree element chopped the code complexity down by a factor of 4 at least. If I hadn't done that HAMMER would have taken another year to finish. I plan on expanding efficiency mainly by making the data references fully extent-based, with each B-Tree element referencing an extent. In fact, read()s can already handle arbitrary extents but write clustering and historical access issues forced me to use only two block sizes for writes (16K and 64K). A HAMMER B-Tree element is theoretically capable of accessing up to 2G of linear storage. :... :> that the B-Tree can seek to directly. In fact, at one point I tried :> indexing on delete_tid instead of create_tid, but created one hell of= : a :> mess in that I had to physically *move* B-Tree elements being deleted :> instead of simply marking them as deleted. : :This is where I do not quite follow you. File contents are never :really deleted in subsequent versions, they are just replaced. To :truncate a file, just add or update size attribute for the target :version and recover any data blocks past the truncate point that :belongs only to the target version. Truncation is a deletion. That is, the B-Tree elements have to be marked as having been deleted so if you truncate-extend a file you see the new space as being full of zeros. Take a 1MB file. Truncate it to 512K, then truncate-extend the file back to 1MB. Or seek-write to extend the file... same difference, really. :Dirent deletion is handled by updating the dirent block, possibly :versioning the entire block. Should the entire block ever be deleted :by a directory repacking operation (never actually happens on the vast :majority of Linux systems) that will be handled as a file truncate. : :I see that in Hammer, the name is deleted from the btree instead, so :fair enough, that is actual deletion of an object. But it could be :handled alternatively by adding a "this object is not here" object, :which might be enough to get rid of delete_tid entirely and just have :create_tid. Hmm. I think the delete_tid is less complex. I'm not really concerned about saving 8 bytes in the B-Tree element. I don't want to make it bigger then the 64 bytes it already is, but neither am I going to add whole new subsystems just to get rid of 8 bytes. :I plan to scan the whole btree initially, which is what we do in ddsnap :and it works fine. But by looking at the mtime attributes in the inode :I can see in which versions the file data was changed and thus not have :to scan most files. Eventually, building in some accelerator to be able :to skip most inode leaves as well would be a nice thing to do. I will :think about that in the background. Yah, that makes sense. You won't have to scan every last tidbit of information. HAMMER would have had to if I had not added the propagating transaction id feature to it. You are still going to have a scaleability issue but it sounds like you could implement the exact same optimization and completely solve the problem. :> You may want to consider something similar. I think using the :> forward-log to optimize incremental mirroring operations is also fine :> as long as you are willing to take the 'hit' of having to scan (though :... : :There is a slight skew in your perception of the function of forward :logging here, I think. Forward logs transactions are not long-lived :disk objects, they just serve to batch together lots of little updates :into a few full block "physical" updates that can be written to the :disk in seek-optimized order. It still works well for this application. Yes, I understand now that you've expanded on that topic. :> The way it works is that HAMMER's frontend is almost entirely :> disconnected from the backend. All meta-data operations are cached :> in-memory. create, delete, append, truncate, rename, write... :> you name it. Nothing the frontend does modifies any meta-data :> or mata-data buffers. : :Things are a little different in Linux. The VFS takes care of most of :what you describe as your filesystem front end, and in fact, the VFS is :capable of running as a filesystem entirely on its own, just by :supplying a few stub methods to bypass the backing store (see ramfs). :I think this is pretty much what you call your front end, though I :probably missed some major functionality you provide there that the :VFS does not directly provide. Well, for file data, up to a point, and HAMMER does just use the kernel's buffer cache for file data. Not for directory entries. I'm pretty sure Linux uses a throw-away namecache just like we do, so unless it has a mechanic for flush sequencing of dirty namecache entries you need in-VFS structural support for caching the operations. In particular, caching a namespace deletion (rm, rename, etc) without touching the meta-data requires implementing a merged lookup so you can cancel-out the directory entry that still exists on-media. So it isn't entirely trivial. Things related to file data are handled by the kernel for the most part, but truncations still need considerable support within the VFS to properly handle seek-write or truncate file extensions. In order to avoid touching any meta-data when handling a truncation, a similar canceling field in the in-memory inode needs to be maintained until the backend is able to delete the related data blocks. Lookups have to take into account the cached truncation offset so as not to read data from the media past the truncation point (e.g. if you seek-write-extend or truncate-extend the file immediately after truncating the file). Most filesystems will dirty meta-data buffers related to the media storage as part of executing an operation on the frontend. I don't know of any which have the level of separation that HAMMER has. :Note: no other filesystem on Linux current works this way. They all :pretty much rely on the block IO library which implements the fairly :goofy one-block-at-a-time transfer regimen. The idea of setting up bio :transfers directly from the filesystem is unfortunately something new. :We should have been doing this for a long time, because the interface :is quite elegant, and actually, that is just what the block IO library :is doing many levels down below. It is time to strip away some levels :and just do what needs to be done in the way it ought to be done. Jeeze, BSD's have been doing that forever. That's what VOP_BMAP is used for. I'm a little surprised that Linux doesn't do that yet. I'll expand on that down further. :... :> the media. The UNDO *is* flushed to the media, so the flush groups :> can build a lot of UNDO up and flush it as they go if necessary. : :Hmm, and I swear I did not write the above before reading this paragraph. :Many identical ideas there. Yah, I'm editing my top-side responses as I continue to read down. But I'm gleaning a lot of great information from this conversation. :> for fsync() operations... those would go much faster using a forward :> log then the mechanism I use because only the forward-log would have :> to be synchronized (not the meta-data changes) to 'commit' the work. :> I like that, it would be a definite advantage for database operations. : :I think you will like it even more when you realize that updating the :filesystem header is not required for most transactions. A definite advantage for a quick write()/fsync() like a database might do. I might be able to implement a forward-log for certain logical operations. Due to my use of a fixed UNDO FIFO area I can pre-format it and I can put a 64 bit sequence number at the head of each block. There would be no chance of it hitting a conflict. Hmm. Yes, I do think I could get rid of the volume header update without too much effort. :A versioned extent: : : struct extent { unsigned version:10, count:6, block:24; }; : :Directory format to index the extent within a leaf: : : struct entry { unsigned loglo:24, offset:8 }; : struct group { unsigned loghi:24, count:8 }; : :Leaf layout (abridged): : : struct leaf { unsigned groups; struct group[]; struct entry[]; struct el= :ements[] }; :... :See, I am really obsessive when it comes to saving bits. None of these :compression hacks costs a lot of code or has a big chance of hiding a :bug, because the extra boundary conditions are strictly local. Man, those are insanely small structures. Watch out for the cpu and in-kernel-memory tracking overhead. : * The linux page cache not only caches data but provides an efficient : access path to logical blocks that avoids having to consult the : filesystem metadata repeatedly under heavy traffic. Actually, BSD buffer caches are logical caches, and always have been. Linux actually took a page from our book on that one. In fact, if you do a google search I'll bet you will see my name in some of those Linux VM system debates 5-10 years ago :-) The typical BSD (Open, Net, Free, DragonFly, etc) buffer cache structure is a logically indexed entity which can also contain a cached physical translation (which is how the direct data bypass works). :... :> Are you just going to leave them in the log and not push a B-Tree :> for them?=20 : :It did not occur to me to leave them in the logical log, but yes that :is a nice optimization. They will only stay there for a while, until a :rollup is triggered, moving them into the inode table proper. It gets :a little tricky to figure out whether the amortization due to the :logical logging step is worth the extra write. There is a crossover :somewhere, maybe around a hundred bytes or so, ideal for symlinks and :acls. I'm just going to throw this out. In an earlier incarnation of the HAMMER design I had a forward log which embedded data, and the idea was that I would leave the data in-place in the log. I was going to use a blockmap to turn the log into a sparse-map as a means of recovering space from it. I eventually threw away that idea but it occurs to me that you might be able to use a similar idea to compress small bits of meta-data. :> :I might eventually add some explicit cursor caching, but various :> :artists over the years have noticed that it does not make as much :> :difference as you might think. :>=20 :> For uncacheable data sets the cpu overhead is almost irrelevant. : :For rotating media, true. There is also flash, and Ramback... Yes, true enough. Still only a 20% cache hit though, due to the many other in-memory structures involved (namecache, vnode, in-memory inode, etc). :> I think it will be an issue for people trying to port HAMMER. I'm tr= :ying :> to think of ways to deal with it. Anyone doing an initial port can=20 :> just drop all the blocks down to 16K, removing the problem but :> increasing the overhead when working with large files. : :Do the variable sized page hack :-] Noooo! My head aches. :Or alternatively look at the XFS buffer cache shim layer, which :Christoph manhandled into kernel after years of XFS not being accepted :into mainline because of it. The nice advantage of controlling the OS code is I could change DragonFly's kernel to handle mixed block sizes in its buffer cache. But I still want to find a less invasive way in the HAMMER code to make porting to other OS's easier. :Ext2 and Ext3) so going to extents is not optional. I plan a limit of :64 blocks per extent, cutting the metadata overhead down to 16K per :gigabyte, which is hard to complain about. Larger extents in the file :index would not buy much and would mess up the nice, 8 byte :one-size-fits-all versioned extent design. That is reasonable. I don't plan on going past about a megabyte per extent myself, there's just no reason to go much bigger. :> Orphan inodes in HAMMER will always be committed to disk with a :> 0 link count. The pruner deals with them after a crash. Orphan :> inodes can also be commited due to directory entry dependancies. : :That is nice, but since I want to run without a pruner I had better :stick with the logical logging plan. The link count of the inode will :also be committed as zero, or rather, the link attribute will be :entirely missing at that point, which allows for a cross check. It won't work for you if your logical log is meant to be short-lived. Programs can hold unlinked inodes open for hours, even days. Even longer. You would have to rewrite the entry in order to be able to move the caboose. Well, that isn't a bad idea, just more complication. :> I do wish we had something like LVM on BSD systems. You guys are very :> lucky in that regard. LVM is really nice. : :But I thought you BSD guys have GEOM. I think GEOM is really nice, and :I think that Linux LVM has so many problems that I have started work on :a replacement, LVM3. GEOM is fairly simplistic compared to LVM, but I'll keep your comments regarding LVM in mind (never having used it). GEOM's claim to fame is its encryption layer. Maybe the ticket for DragonFly is to simply break storage down into a reasonable number of pieces, like cutting up a drive into 256 pieces, and create a layer to move and glue those pieces together into larger logical entities. :> BTW it took all day to write this! : :It took two days to write this ;-) : :I hope the fanout effect does not mean the next post will take four days. : :Regards, : :Daniel I'm editing down as much as I can :-) This one took 6 hours, time to get lunch! -Matt Matthew Dillon <dillon@backplane.com>
From: Daniel Phillips Subject: Re: [Tux3] Comparison to Hammer fs design Date: Jul 27, 6:48 pm 2008 On Sunday 27 July 2008 14:31, Matthew Dillon wrote: > :When a forward commit block is actually written it contains a sequence > :number and a hash of its transaction in order to know whether the > :... > :Note: I am well aware that a debate will ensue about whether there is > :any such thing as "acceptable risk" in relying on a hash to know if a > :commit has completed. This occurred in the case of Graydon Hoare's > :Monotone version control system and continues to this day, but the fact > :is, the cool modern version control systems such as Git and Mercurial > :now rely very successfully on such hashes. Nonetheless, the debate > :will keep going, possibly as FUD from parties who just plain want to > :use some other filesystem for their own reasons. To quell that > :definitively I need a mount option that avoids all such commit risk, > :perhaps by providing modest sized journal areas salted throughout the > :volume whose sole purpose is to record log commit blocks, which then > :are not forward. Only slightly less efficient than forward logging > :and better than journalling, which has to seek far away to the journal > :and has to provide journal space for the biggest possible journal > :transaction as opposed to the most commit blocks needed for the largest > :possible VFS transaction (probably one). > > Well, you've described both sides of the debate quite well. I for > one am in the 0-risk camp. ReiserFS got into trouble depending on > a collision space (albeit a small one). I did extensive testing of > 64 bit collision spaces when I wrote Diablo (A USENET news system > contemporary with INN, over 10 years ago) and even with a 64 bit space > collisions could occur quite often simply due to the volume of article > traffic. > > I think the cost can be reduced to the point where there's no need > to allow any risk at all. After all, we're only talking about meta-data > updates here for the most part. Even under heavy loads it can take > 30 seconds for enough dirty meta-data to build up to require a flush. > One can flush the mess, then seek/update the volume header. So I do not want users to fixate on that detail. The mount option allows them to choose between "fast but theoretically riskier" and "warm n fuzzy zero risk but not quite so fast". If the example of Ext3 is anything to go by, almost everybody chooses the "ordered data" mode over the "journal data" mode given the tradeoff that the latter is about 30% slower but offers better data integrity for random file rewrites. The tradeoff for the option in Tux3 will be maybe 1 - 10% slower in return for a miniscule reduction in the risk of a false positive on replay. Along the lines of deciding to live underground to avoid the risk of being hit by a meteorite. Anyway, Tux3 will offer the option and everybody will be happy. Actually implementing the option is pretty easy because the behavior variation is well localized. > Or doing something like using a fixed LOG area, allowing it to be > preformatted... that would remove 100% of the risk and not require > a volume header update (drafting note: edited up from further below). > > I know volume headers seem old-school. It was a quick and dirty > solution in HAMMER that I will be changing. > > :Actually, the btree node images are kept fully up to date in the page > :cache which is the only way the high level filesystem code accesses > :them. They do not reflect exactly what is on the disk, but they do > :reflect exactly what would be on disk if all the logs were fully > :rolled up ("replay"). > > Ok, this is simpler. Hmm. That brings up the question with regards > to data allocations and the log. Are you going to use a fixed log > space and allocate data-store separately or is the log a continuously > running pointer across the disk? > > My understanding is it is a continuously running space. Yes, a continuous running space, not preallocated. The forward log will insert itself into any free blocks that happen to be near the the transaction goal location. Such coopted free space will be implicitly unavailable for block allocation, slightly complicating the block allocation code which has to take into consideration both the free space map and all the outstanding log transactions. > :A forward log that carries the edits to some dirty cache block pins > :that dirty block in memory and must be rolled up into a physical log > :before the cache block can be flushed to disk. Fortunately, such a > :rollup requires only a predictable amount of memory: space to load > :enough of the free tree to allocate space for the rollup log, enough > :... > > Ok, here I spent about 30 minutes constructing a followup but then > you answered some of the points later on, so what I am going to do > is roll it up into a followup on one of your later points :-) > > :One traditional nasty case that becomes really nice with logical > :forward logging is truncate of a gigantic file. We just need to > :commit a logical update like ['resize', inum, 0] then the inode data > :truncate can proceed as convenient. Another is orphan inode handling > :where an open file has been completely unlinked, in which case we > :log the logical change ['free', inum] then proceed with the actual > :delete when the file is closed or when the log is replayed after a > :surprise reboot. > :... > :Logical log replay is not idempotent, so special care has to be taken > :on replay to ensure that specified changes have not already been > > Wait, it isn't? I thought it was. I think it has to be because > the related physical B-Tree modifications required can be unbounded, Logical logging is not idempotent by nature because of the uncertainty of the state of the object edited by a logical operation: has the operation already been applied or not? If you know something about the structure of the target you can usually tell. Suppose the operation is a dirent create. It has been applied to the target if the direct already exists, otherwise not. I do not like this style of special case hacking to force the logical edit to be idempotent. Instead, I choose to be sure about the state of the target object by introducing the rule that after a logical log operation has been generated, nothing is allowed to write to the object being edited. The logical operation pins the state of the disk image of target object. The object stays pinned until the logical log entry has been retired by a physical commit to the object that updates the object and retires the logical edit in one atomic log transaction. I was working on a specific code example of this with pseudocode and all disk transfers enumerated, but then your email arrived. Well after this response the example will probably be better. > and because physical B-Tree modifications are occuring in parallel the > related physical operations cause the logical operations to become > bound together, meaning the logical ops *cannot* be independantly > backed out Tux3 never backs anything out, so there is some skew to clear up. The logical and physical log operations are indeed bound together at both the disk image and the processor level, but not in the way that you might expect. The primary purpose of rolling up logical log entries into physical updates is to control resource consumption; the secondary purpose is to reduce the amount of replay required to reconstruct "current" memory images of the btree blocks on reboot or remount. In the first case, resource exhaustion, a high level vfs transaction may have to wait for a rollup to finish, cleaning a number of dirty buffer cache blocks and releasing resources needed for the rollup, such as bio structs and lists of transaction components. Such blocking is nominally enforced by the VMM by sending the requesting process off to scan for and try to free up some dirty memory (a horribly bad design idea in Linux that causes recursive calls into the filesystem, but that is the way it works) during which time the process is effectively blocked. Tux3 will add a more responsible level of resource provisioning by limiting the number of transactions that can be in flight, much as Ext3 does. This allows the filesystem to fullfill a promise like "never will use up all of kernel reserve memory after having been given privileged access to it in order to clean and recover some dirty cache blocks". In the second case, keeping the replay path short, high level operations can proceed in parallel with physical updates, because a transaction is _logically_ completed as soon as committing the associated logical edits to disk has completed. So high level operations block on logical commit completions, not on physical commit completions except in the case of resource exhaustion. > Here's an example: > > rm a/b/c <---- 1TB file > rmdir a/b > rmdir a > > There are three problems here. > > First, the logical log must be idempotent, or at least mostly > idempotent, because the physical portions of the three separate > logical transactions will complete at different times. If the user writes: rm a/b/c& rm a/b& rm a& Then they deserve what they get, which is a probable complaint that a directory does not exist. If the operations happen to execute in a favorable order then they will all succeed. With Tux3 this is possible because destroying the 1TB file can proceed asynchronously while the VFS transaction completes and returns as soon as a commit containing ['unlink', inum_a/b/c, dnum_a/b] ['destroy', inum_a/b/c] has been logged. So deleting the 1TB file can be very fast, although it may be some time before all the occupied disk space becomes available for reuse. To make the deletion visible to high level filesystem operations, some cached disk blocks have to be updated. For a huge deletion like this it makes sense to invent a versioned "orphan" attribute, which is added to the inode table leaf, meaning that any data lookups arriving from still-open files should ignore the file index tree entirely. > Second, the logical log entry for "rm a/b/c" cannot be destroyed > (due to log cycling based on available free space) until after the > related physical operation has completed, which could occur seconds > to minutes later. True, but that only pins one commit block on disk. > The second issue could become a big issue for you because it means > the log space required will become unbounded or nearly unbounded. > If the remove requires 10's to 100's of megabytes of physical log > entries, all appending to the log, and the physical blocks backing > the log cannot be recovered due to that 'rm a/b/c' needing to remain > in the log until the physical operation is complete, then I think you > have a problem. You would probably need to special-case it. I do not think I have that problem because recovering the space used by the big file can proceed incrementally: free a few blocks from the inode index leaves; commit the revised file index blocks and free tree blocks; repeat as necessary. After a crash, find the logical inode destroy log entry and resume the destroy. > A third issue effecting the amount of free space required to complete > an operation is going to be the worst case physical log space required > for a single low-level B-Tree transaction, which will roughly be > related to the depth of the B-Tree, multiplied by the number of modifying > B-Tree operations being done in parallel (e.g. from many individual > processes). Yes indeed. A matter of totalling all that up, which is fortunately bounded to a nice low number. Then in grand Linux tradition, ignore the totals and just rely on there being "megabytes" of reserve memory to handle the worst case. Obviously a bad idea, but that is how we have done things for many years. > :... > :ugly or unreliable when there are lot of stacked changes. Instead I > :introduce the rule that a logical change can only be applied to a known > :good version of the target object, which promise is fullfilled via the > :physical logging layer. > :... > :then apply the logical changes to it there. Where interdependencies > :exist between updates, for example the free tree should be updated to > :reflect a block freed by merging two btree nodes, the entire collection > :of logical and physical changes has to be replayed in topologically > :sorted order, the details of which I have not thought much about other > :than to notice it is always possible. > > I came up with a really simple solution to the freemap-reuse issue > which you could apply to allow such replays to be idempotent. You > simply do not allow a freed block to be reallocated until after all > the information related to the logical operation that freed it has > been solidly committed. Thus you can replay the low level frees > without accidently re-freeing a block that has been reallocated. > > In HAMMER I delay reuse of freed blocks because UNDO's are only for > meta-data, not file data, and thus freed data blocks cannot be reused > until after the related freeing operations has been completely > committed so crash recovery can do the rollback. You could use them > to make otherwise non-idempotent operations idempotent. It sounds like a good idea, I will ponder. > :When replay is completed, we have a number of dirty cache blocks which > :are identical to the unflushed cache blocks at the time of a crash, > :and we have not yet flushed any of those to disk. (I suppose this gets > :interesting and probably does need some paranoid flushing logic in > :replay to handle the bizarre case where a user replays on a smaller > :memory configuration than they crashed on.) The thing is, replay > :returns the filesystem to the logical state it was in when the crash > :happened. This is a detail that journalling filesystem authors tend > :to overlook: actually flushing out the result of the replay is > :pointless and only obscures the essential logic. Think about a crash > :during replay, what good has the flush done? > > Well, you have to flush at some point anyway because you only have > so much kernel memory to play with.... You do not have enough kernel > memory to cache the dirty buffers for all the physical activity related > to a particular logical operation (such as truncating a 1TB file). I hope that question is cleared up now. > :I do not see why this example cannot be logically logged in pieces: > : > : ['new', inum_a, mode etc] ['link', inum_parent, inum_a, "a"] > : ['new', inum_b, mode etc] ['link' inum_a, inum_b, "b"] > : ['new', inum_c, mode etc] ['link' inum_a_b, inum_c, "c"] > : ['new', inum_x, mode etc] ['link' inum_a_b, inum_x, "x"] > : ['new', inum_d, mode etc] ['link' inum_a_b_c, inum_d, "d"] > : > :Logical updates on one line are in the same logical commit. Logical > :allocations of blocks to record the possibly split btree leaves and new > :allocations omitted for clarity. The omitted logical updates are > :bounded by the depth of the btrees. To keep things simple, the logical > :log format should be such that it is impossible to overflow one commit > :block with the updates required to represent a single vfs level > :transaction. > > Hmm. Maybe. This helps you identify dependancies that must be > replayed, but it might be easier just to make the entire logical > log idempotent and just replay everything. I think the only gotcha > there is handling link counts. That would still be far less complex > then the alternative. The presence of a "link" record in the logical log implies a link count in addition to whatever is recorded in the on-disk inode table leaf. On the other hand, the "current" image of the inode table leaf in the buffer cache has the correct link count when the sys_link returns. > :transactions are not just throwaway things, they are the actual new > :data. Only the commit block is discarded, which I suppose will leave > :a lot of one block holes around the volume, but then I do not have to > :require that the commit block be immediately adjacent to the body of > :the transaction, which will allow me to get good value out of such > :holes. On modern rotating media, strictly linear transfers are not > :that much more efficient than several discontiguous transfers that all > :land relatively close to each other. > > True enough. Those holes will create significant fragmentation > once you cycle through available space on the media, though. It depends on how high the density of such holes is. I will try to keep it down to a few per megabyte. The nice think about having some one block holes strewn around the disk is, there is always a nearby place for a commit block to camp out for a while. > :> So your crash recovery code will have to handle=20 > :> both meta-data undo and completed and partially completed transaction= > : > :Tux3 does not use undo logging, only redo, so a transaction is complete > :as soon as there is enough information on durable media to replay the > :redo records. > > Ok, that works. Be careful w/ your physical logging, it can get > expensive. I think it may be more expensive then you are currently > estimating. I am hoping not to mess that up ;-) The thing about physical logging in Tux3 is, the logged data blocks are normally immediately linked into the index blocks as actual file data or btree metadata blocks, so they are only written once, and they are hopefully written near where the rest of the structure lives. It is only in the optional "update in place" style of physical logging that the commit data blocks are written twice, which will be no worse than a traditional journal and probably a lot better because of not needing to seek to the journal region. > One of the reasons why HAMMER uses an UNDO log is that it is a lot > less expensive then a REDO log. I'll show you why: > > Physical operations related to a single flush group in HAMMER: > > modify element in B-Tree node A (create UNDO w/ original contents of > B-Tree node A) > modify element in B-Tree node A (no new UNDO needs to be created) > modify element in B-Tree node A (no new UNDO needs to be created) > modify element in B-Tree node A (no new UNDO needs to be created) > modify element in B-Tree node A (no new UNDO needs to be created) Nice, now I understand. But do you not have to hold all filesystem transactions that depend on the modification until the btree node has completed writing to disk? With logical logging you only have to wait on the logical commit to complete, into which may be packed a number of other changes for efficiency. > (NOTE: I do realize that the REDO log can be compressed just as the > UNDO one, by recording actual data shifts from B-Tree insertions > and deletions, it gets really complex when you do that, though, and > would not be idempotent). As described above, it is not how I do it. > See the advantage? Many B-Tree operations working on the same node, > or revisiting that node later on, within the same flush group, do not > have to lay down more then one UNDO record for that B-Tree node. > > The physical ops are modifying already modified space so I don't > have to record each op in the UNDO log. Upon crash recovery the whole > mess is undone by that first UNDO record. > > I just run the UNDOs backwards, which allows me to cache the in-flush > undo areas with a fixed set of small throw-away tracking structures > in system memory. If a structure gets thrown away it just means an > extranious UNDO record will be appended to the UNDO FIFO. I think I see it, but I have my doubts because you have to block transactions waiting for the up to date copy of the btree data to land on disk. Either that, or you may give userspace the impression that some transaction has gotten onto stable storage when that is not the case. If you do in fact block transactions until the current image of the btree leaf has been written out then the blocking behavior is no better than REDO style, and REDO can batch together many updates to different logical blocks into a single commit block, which seems like fewer transfers overall. > :Yes. Each new commit records the sequence number of the oldest commit > :that should be replayed. So the train lays down track in front of > :itself and pulls it up again when the caboose passes. For now, there > :is just one linear sequence of commits, though I could see elaborating > :that to support efficient clustering. > > Yah. I see. Noting that issue I brought up earlier about the > "rm a/b/c". Locking the caboose of your log until all related physical > operations have been completed could create a problem. I hope that is cleared up now. > :One messy detail: each forward log transaction is written into free > :space wherever physically convenient, but we need to be sure that that > :free space is not allocated for data until log rollup has proceeded > :past that transaction. One way to do this is to make a special check > :against the list of log transactions in flight at the point where > :extent allocation thinks it has discovered a suitable free block, which > :is the way ddsnap currently implements the idea. I am not sure whether > :I am going to stick with that method for Tux3 or just update the disk > :image of the free tree to include the log transaction blocks and > :somehow avoid logging those particular free tree changes to disk. Hmm, > :a choice of two ugly but workable methods, but thankfully neither > :affects the disk image. > > You might be able to adapt the delayed-reuse concept I used in HAMMER > to simplify the problem, possibly to the point of making those > operations replayable. > > Another thing I did in HAMMER was reserve space on the frontend for > data blocks without modifying the freemap at all. This allows the > frontend to write file data blocks to the physical media without > having to wait for synchronization with meta-data or the backend or > anything. > > I keep track of the reserved space with small in-memory structures > (whos function is to prevent other allocations from using the reserved > space until it can be committed). Freemap updates are then committed > in the same flush that commits the related B-Tree insertions and > deletions. Thus the freemap is kept in sync. I am gravitating towards that style for Tux3's commit-related short term allocations. > :Ext2 dirents are 8 bytes + name + round up to 4 bytes, very tough to > :beat that compactness. We have learned through bitter experience that > :anything other than an Ext2/UFS style physically stable block of > :dirents makes it difficult to support NFS telldir cookies accurately > :because NFS vs gives us only a 31 bit cookie to work with, and that is > :not enough to store a cursor for, say, a hash order directory > :traversal. This is the main reason that I have decided to go back to > :basics for the Tux3 directory format, PHTree, and make it physically > :stable. > > The cookies are 64 bits in DragonFly. I'm not sure why Linux would > still be using 32 bit cookies, file offsets are 64 bits so you > should be able to use 64 bit cookies. It is not Linux that perpetrates this outrage, it is NVFS v2. We can't just tell everybody that their NFS v2 clients are now broken. > For NFS in DragonFly I use a 64 bit cookie where 32 bits is a hash key > and 32 bits is an iterator to deal with hash collisions. Poof, > problem solved. Which was my original proposal to solve the problem. Then Ted told me about NFS v2 :-O Actually, NFS hands you a 62 bit cookie with the high bits of both s32 parts unused. NFS v2 gives you a 31 bit cookie. Bleah. > :In the PHTree directory format lookups are also trivial: the directory > :btree is keyed by a hash of the name, then each dirent block (typically > :one) that has a name with that hash is searched linearly. Dirent block > :pointer/hash pairs are at the btree leaves. A one million entry > :directory has about 5,000 dirent blocks referenced by about 1000 btree > :leaf blocks, in turn referenced by three btree index blocks (branching > :factor of 511 and 75% fullness). These blocks all tend to end up in > :the page cache for the directory file, so searching seldom references > :the file index btree. > > That seems reasonable. I'll look up the PHTree stuff, it sounds > interesting. Personally speaking I don't mind eating 64 bytes per > directory entry. I can accomodate file names up to 16 bytes in > the B-Tree element itself so 95% of the namespace doesn't even > require a look-aside data reference. > > The one mistake I made, which I didn't have time to fix in the > first HAMMER release, is that my directory hash is a standard crc > which means that records in the B-Tree aren't even close to being > in any sort of lexical order. I will probably wind up changing > the hash calculation a bit to get some lexical sorting information > stuffed into it... should be doable since the hash keys are 64 bits. Yes, I noticed that. Check out dx_hack_hash: http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/hash.c#L38 It distributed hashes of ascii strings very well for me, with few funnels. It way outperformed some popular hashes like TEA. Ted's cut down crytographic hash is yet more uniform but costs much more CPU. > :will pass Hammer in lookup speed for some size of directory because of > :the higher btree fanout. PHTree starts from way behind courtesy of the > :32 cache lines that have to be hit on average for the linear search, > :amounting to more than half the CPU cost of performing a lookup in a > :million element directory, so the crossover point is somewhere up in > :the millions of entries. > > Well, the B-Tree fanout isn't actually that big a deal. Remember > HAMMER reblocks B-Tree nodes along with everything else. B-Tree > nodes in a traversal are going to wind up in the same 16K filesystem > block. It affects the number of probes you have to do for the lookup. Binsearching hits one cache line per test while each node lookup hits a lot more. > That leaves just element space overhead, which frankly is not much > of an issue on a production system because kernels have to cache a lot > more then just the directory entry. The in-memory vnode and inode > structures take up a lot more space then the cached directory entry. > So even though HAMMER's directory entry is a 64-byte B-Tree element > and roughly double the overhead of a UFS-style directory entry, the > overall cache footprint in the system is only ~20% worse. It is not just the cache footprint but the time it takes to get your key tables into L1 cache. To be sure, we are talking about small details here, but these small details can add up to a difference of a factor of two in execution speed. Which maybe you don't care much about now that you are orders of magnitude faster than what came before ;-) > :... > :> that the B-Tree can seek to directly. In fact, at one point I tried > :> indexing on delete_tid instead of create_tid, but created one hell of= > : a > :> mess in that I had to physically *move* B-Tree elements being deleted > :> instead of simply marking them as deleted. > : > :This is where I do not quite follow you. File contents are never > :really deleted in subsequent versions, they are just replaced. To > :truncate a file, just add or update size attribute for the target > :version and recover any data blocks past the truncate point that > :belongs only to the target version. > > Truncation is a deletion. That is, the B-Tree elements have to > be marked as having been deleted so if you truncate-extend a file > you see the new space as being full of zeros. > > Take a 1MB file. Truncate it to 512K, then truncate-extend the > file back to 1MB. Or seek-write to extend the file... > same difference, really. Following my prescription above, the file will be full of zeros when re-extended because the data pointers were removed. But I think you are right, truncate really is a delete and I actually argued myself into understanding that. > :I plan to scan the whole btree initially, which is what we do in ddsnap > :and it works fine. But by looking at the mtime attributes in the inode > :I can see in which versions the file data was changed and thus not have > :to scan most files. Eventually, building in some accelerator to be able > :to skip most inode leaves as well would be a nice thing to do. I will > :think about that in the background. > > Yah, that makes sense. You won't have to scan every last tidbit of > information. HAMMER would have had to if I had not added the > propagating transaction id feature to it. > > You are still going to have a scaleability issue but it sounds like > you could implement the exact same optimization and completely solve > the problem. Yes, I have thought about it a little more, and I imagine something like a small array of dirty bits that climb up the btree with the help of logical logging, where each bit means that there is something interesting for some corresponding replication target to look at, somewhere down in the subtree. > :> The way it works is that HAMMER's frontend is almost entirely > :> disconnected from the backend. All meta-data operations are cached > :> in-memory. create, delete, append, truncate, rename, write... > :> you name it. Nothing the frontend does modifies any meta-data > :> or mata-data buffers. > : > :Things are a little different in Linux. The VFS takes care of most of > :what you describe as your filesystem front end, and in fact, the VFS is > :capable of running as a filesystem entirely on its own, just by > :supplying a few stub methods to bypass the backing store (see ramfs). > :I think this is pretty much what you call your front end, though I > :probably missed some major functionality you provide there that the > :VFS does not directly provide. > > Well, for file data, up to a point, and HAMMER does just use the > kernel's buffer cache for file data. > > Not for directory entries. I'm pretty sure Linux uses a throw-away > namecache just like we do, so unless it has a mechanic for flush > sequencing of dirty namecache entries you need in-VFS structural > support for caching the operations. The Linux dentry cache actually implements proper namespace semantics all by itself without needing to flush anything, which is what Ramfs is. Ramfs just lives entirely in cache without even being backed by a ramdisk, until the power goes off. > In particular, caching a namespace deletion (rm, rename, etc) > without touching the meta-data requires implementing a merged lookup > so you can cancel-out the directory entry that still exists on-media. > So it isn't entirely trivial. Linux implements "negative dentries" to say "this entry is not here". > Things related to file data are handled by the kernel for the most > part, but truncations still need considerable support within the > VFS to properly handle seek-write or truncate file extensions. > In order to avoid touching any meta-data when handling a truncation, > a similar canceling field in the in-memory inode needs to be > maintained until the backend is able to delete the related data > blocks. Lookups have to take into account the cached truncation offset > so as not to read data from the media past the truncation point (e.g. > if you seek-write-extend or truncate-extend the file immediately > after truncating the file). I have not looked at that part for a while now, and I did not look at it just now, but Al Viro has been fiddling with it for years getting the bugs out one at a time. The last major one I heard of was some time ago. It works somehow, I should look more closely at it. > Most filesystems will dirty meta-data buffers related to the media > storage as part of executing an operation on the frontend. I don't > know of any which have the level of separation that HAMMER has. Modern Linux filesystems get close I think. Particularly in journalled data mode, Ext3 marks all the buffers it deals with as "don't touch" to the VFS and VMM, which have no idea how to obey the necessary ordering constraints. There is also this thing called the "journalling block device" that provides an abstract implementation of a physical journal, which is actually used by more than one filesystem. (Three I think, including Ext4, however I now hear noise about rewriting it.) > :Note: no other filesystem on Linux current works this way. They all > :pretty much rely on the block IO library which implements the fairly > :goofy one-block-at-a-time transfer regimen. The idea of setting up bio > :transfers directly from the filesystem is unfortunately something new. > :We should have been doing this for a long time, because the interface > :is quite elegant, and actually, that is just what the block IO library > :is doing many levels down below. It is time to strip away some levels > :and just do what needs to be done in the way it ought to be done. > > Jeeze, BSD's have been doing that forever. That's what VOP_BMAP is > used for. I'm a little surprised that Linux doesn't do that yet. > I'll expand on that down further. Either you forgot to expand or I missed it. I am interested. > :Directory format to index the extent within a leaf: > : > : struct entry { unsigned loglo:24, offset:8 }; > : struct group { unsigned loghi:24, count:8 }; > ... > Man, those are insanely small structures. Watch out for the cpu > and in-kernel-memory tracking overhead. I have written the code, mostly, and it is tight. A similar idea has been part of ddsnap for 5 years now. > : * The linux page cache not only caches data but provides an efficient > : access path to logical blocks that avoids having to consult the > : filesystem metadata repeatedly under heavy traffic. > > Actually, BSD buffer caches are logical caches, and always have been. > Linux actually took a page from our book on that one. In fact, if > you do a google search I'll bet you will see my name in some of those > Linux VM system debates 5-10 years ago :-) I remember it well. You were the one who put Rik up the reverse map design that caused the main fireworks between 2.3 and 2.4.9. (I was the one who finally got it work by social-engineering the merging of the reverse map part with Andrea's more robust LRU design.) I also remember the time when BSD buffers were far superior to Linux ones, yes. In 2.0 the arrangement sucked enormously: sort-of coherency between the buffer cache and the page cache was achieved by physically copying changes from one to the other. Today, the buffer cache and page cache are nearly fully unified. But the unification will never be complete until the page size is variable, so the remaining tasks done by buffers can be done by pages instead. Anyway, this part of your OS is more similar than different, which will help a lot with a port. > The typical BSD (Open, Net, Free, DragonFly, etc) buffer cache structure > is a logically indexed entity which can also contain a cached physical > translation (which is how the direct data bypass works). Linux caches physical translations by gluing one or more buffer heads onto each page, which is one of the few remaining uses of buffer heads. To finally get rid of buffer heads the physical cache needs to be done some other way. I am sure there are lots of better ways. > I'm just going to throw this out. In an earlier incarnation of the > HAMMER design I had a forward log which embedded data, and the idea > was that I would leave the data in-place in the log. I was going to > use a blockmap to turn the log into a sparse-map as a means of > recovering space from it. > > I eventually threw away that idea but it occurs to me that you might > be able to use a similar idea to compress small bits of meta-data. This is similar to what I do when I write out a physical block and use a logical log entry to make the on-disk parent point at it (actually, it is a promise to make the parent point at it some time in the future. But then I do not let those things live very long. Eventually it might make sense to let them live a little longer, perhaps by generalizing the idea of the single linear log to a forest of little logs, some of which could be left around for a long time instead of being quickly rolled up into the canonical structures. > :> Orphan inodes in HAMMER will always be committed to disk with a > :> 0 link count. The pruner deals with them after a crash. Orphan > :> inodes can also be commited due to directory entry dependancies. > : > :That is nice, but since I want to run without a pruner I had better > :stick with the logical logging plan. The link count of the inode will > :also be committed as zero, or rather, the link attribute will be > :entirely missing at that point, which allows for a cross check. > > It won't work for you if your logical log is meant to be short-lived. > Programs can hold unlinked inodes open for hours, even days. Even > longer. You would have to rewrite the entry in order to be able to > move the caboose. Well, that isn't a bad idea, just more complication. Indeed, I need to handle that. Since I do want to stick with a simple linear sequence of logical log commits for now, I do not want to leave any of them sitting around for a log time. One easy thing to do is to put any orphans aside at rollup time, in an unversioned orphan file, say. > Maybe the ticket for DragonFly is to simply break storage down into > a reasonable number of pieces, like cutting up a drive into 256 pieces, > and create a layer to move and glue those pieces together into larger > logical entities. Linux LVM is all about device mapper, and actually, device mapper does not really have a reason to be a separate subsystem, it can just be the way the block layer works. If interested, check out my posts on bio stacking and rolling device mapper's "clone_and_map" idea up into the generic block layer. I think all unixen should do this. Other than that, device mapper is just about making device numbers point at different virtual devices, transparently to userspace. Even device stacking works like that: extract the "virtual" device (which may or may not be a real device) from the device number you are going to remap, store it in some other device number, create a new virtual device on top of the other device number, finally store the new virtual device in the old device number. The old device number has now been transparently stacked. Nothing to it. > I'm editing down as much as I can :-) This one took 6 hours, time to > get lunch! Only 3 hours so far here... hmm, make that four. Daniel
From: Matthew Dillon Subject: Re: [Tux3] Comparison to Hammer fs design Date: Jul 27, 8:46 pm 2008 :Tux3 never backs anything out, so there is some skew to clear up. Got it. I'm still not clear on the physical vs logical log interactions required for crash recovery, but I think I understand everything else. :> One of the reasons why HAMMER uses an UNDO log is that it is a lot :> less expensive then a REDO log. I'll show you why: :> :> Physical operations related to a single flush group in HAMMER: :> :> modify element in B-Tree node A (create UNDO w/ original contents of :> B-Tree node A) :> modify element in B-Tree node A (no new UNDO needs to be created) :> modify element in B-Tree node A (no new UNDO needs to be created) :> modify element in B-Tree node A (no new UNDO needs to be created) :> modify element in B-Tree node A (no new UNDO needs to be created) : :Nice, now I understand. But do you not have to hold all filesystem :transactions that depend on the modification until the btree node has :completed writing to disk? With logical logging you only have to wait :on the logical commit to complete, into which may be packed a number of :other changes for efficiency. The only major interlock between frontend searches and backend flushes occurs on an inode-by-inode basis and only for the duration of the modifying B-Tree operation, which does *not* include the writing of the modified meta-data buffer to disk. Once the operation concludes we are left with dirty meta-data buffers which for all intents and purposes have transfered the in-memory cached B-Tree records (held as separate entities in a red-black tree in-memory) into the media view of the B-Tree (having performed the actual B-Tree operations necessary to insert or delete the record), and the inode lock is then released. The flusher flushes each inode on the flush list in turn, with certain long-lived inodes (e.g. truncation of a 1TB file to 0) only getting partial goodness and then being moved to the next flush list. The inode lock is held for each individual inode as it is being flushed, one inode at a time. The actual dirty meta-data buffers are flushed later on, without any inode lock held and without having any blocking effect on the frontend. This media-write sequence occurs at the end, after all inodes have been processed. UNDO finishes writing, we wait for I/O to complete, we update the volume header, we wait for I/O to complete, then we asynchronously flush the dirty media buffers (and don't wait for that I/O to complete so the writes can be merged with the beginning of the next flush group). That is what I meant when I said the frontend was disconnected from the backend. The worst the frontend ever does is issue reads on media meta-data as part of any merged memory-media B-Tree searches it must perform to locate information. Any modifying operations made by the frontend will create records in the in-memory red-black tree, but will *NOT* do any modifying oprations on meta-data buffers representing the on-media B-Tree. Frontend lookups are resolved by doing a merged-search with two cursors (one on the in-memory red-black tree, and one on the on-media B-Tree). I considered doing the interlock at the record level but hit some hicups in that one of the cursors may have already passed the memory record being committed but the other cursor (the media view) may not have yet passed the B-Tree index where that record would be written to. So if a commit were to occur right then, without a governing inode lock, the merged search might see a duplicate record (or no record in either view in the case of a deletion). :> I just run the UNDOs backwards, which allows me to cache the in-flush :> undo areas with a fixed set of small throw-away tracking structures :> in system memory. If a structure gets thrown away it just means an :> extranious UNDO record will be appended to the UNDO FIFO. : :I think I see it, but I have my doubts because you have to block :transactions waiting for the up to date copy of the btree data to land :on disk. Either that, or you may give userspace the impression that :some transaction has gotten onto stable storage when that is not the :case. Userland doesn't care whether the actual transactions have made it to disk or not unless it actually calls fsync(). If it does then it will block through the necessary flush cycles. fsync() on HAMMER is pretty slow as it will likely be syncing a lot more in the flush group then just the requested inode. If userland does not call fsync() (or 'sync', etc) then it doesn't block at all, except as described above during the modify-buffer phase of the B-Tree updates (on an inode-by-inode basis so actual blocking is virtually non-existant). It is often possible to copy an entire directory tree without initiating a backend flush. Media B-Tree elements, inclusive of any held in dirty meta-data buffers, are locked on a B-Tree node by B-Tree node basis. These are very short lived locks and have nothing to do with the actual flushing of the dirty meta-data buffers to the physical media. The frontend always gets shared locks, the backend always gets exclusive locks. Some blocking between the frontend and backend occurs but not a whole lot. The backend flusher actually blocks against itself more often (it runs multiple threads), due to the locality of reference within the B-Tree for nearby inodes. :> The cookies are 64 bits in DragonFly. I'm not sure why Linux would :> still be using 32 bit cookies, file offsets are 64 bits so you :> should be able to use 64 bit cookies. : :It is not Linux that perpetrates this outrage, it is NVFS v2. We can't :just tell everybody that their NFS v2 clients are now broken. Oh, we don't care about NFSv2 all that much any more. NFSv3 is the bare minimum. NFSv2 is extremely old, nobody should be using it any more. Even NFSv3 is getting fairly long in the tooth now. :> For NFS in DragonFly I use a 64 bit cookie where 32 bits is a hash key :> and 32 bits is an iterator to deal with hash collisions. Poof, :> problem solved. : :Which was my original proposal to solve the problem. Then Ted told me :about NFS v2 :-O : :Actually, NFS hands you a 62 bit cookie with the high bits of both s32 :parts unused. NFS v2 gives you a 31 bit cookie. Bleah. I'd recommend dropping support for NFSv2. It is not really worth supporting any more. Does it even support 64 bit inodes? (I don't remember), or 64 bit file offsets? NFSv2 is garbage. You should be able to use 63 bits of the cookie, I don't know why you wouldn't use the high bit of the lsb 32 bit part. There is no requirement that that bit be 0. In fact, the RFC says the cookie is a 64 bit unsigned integer and you should be able to use all 64 bits. If linux is not allowing all 64 bits to be used then it's a serious bug in linux. The cookies are supposed to be opaque, just like the file handle. :> which means that records in the B-Tree aren't even close to being :> in any sort of lexical order. I will probably wind up changing :> the hash calculation a bit to get some lexical sorting information :> stuffed into it... should be doable since the hash keys are 64 bits. : :Yes, I noticed that. Check out dx_hack_hash: : : http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/hash.c#L38 :.. :It distributed hashes of ascii strings very well for me, with few :funnels. It way outperformed some popular hashes like TEA. Ted's cut :down crytographic hash is yet more uniform but costs much more CPU. I like :-). I'll probably end up with multiple directory hash formats too before this is over. Fortunately I have room in the inode to specify the format the directory will use. :... :> Well, the B-Tree fanout isn't actually that big a deal. Remember :> HAMMER reblocks B-Tree nodes along with everything else. B-Tree :> nodes in a traversal are going to wind up in the same 16K filesystem :> block. : :It affects the number of probes you have to do for the lookup. :Binsearching hits one cache line per test while each node lookup hits a :lot more. :... :It is not just the cache footprint but the time it takes to get your :key tables into L1 cache. To be sure, we are talking about small :details here, but these small details can add up to a difference of a :factor of two in execution speed. Which maybe you don't care much about :now that you are orders of magnitude faster than what came before ;-) I'm more worried about extra disk seeks, bloated UNDO records, and blockages due to B-Tree locking. The fully cached stuff already performs extremely well and I don't feel that I have to eek out much more performance out of it. And, frankly, the L1/L2 cache characteristics are a wash and I'm already burning a lot of inefficient data accesses by doing the power-of-2 search on the elements in the first place. Finding a better way to home in on the B-Tree element, aka getting that power-of-2 search out of the critical path... that would be worth doing. Worrying about an occassional node traversal is way below the noise floor. :Yes, I have thought about it a little more, and I imagine something :like a small array of dirty bits that climb up the btree with the help :of logical logging, where each bit means that there is something :interesting for some corresponding replication target to look at, :somewhere down in the subtree. You can do with one transaction id / mtime field in the B-Tree node header. Strictly speaking you don't need a per-element field on internal elements, though I have one there too because I had the space. :> Not for directory entries. I'm pretty sure Linux uses a throw-away :> namecache just like we do, so unless it has a mechanic for flush :> sequencing of dirty namecache entries you need in-VFS structural :> support for caching the operations. : :The Linux dentry cache actually implements proper namespace semantics :all by itself without needing to flush anything, which is what Ramfs :is. Ramfs just lives entirely in cache without even being backed by :a ramdisk, until the power goes off. That's cool. You will want to flush it since you are backing the system caches with a real filesystem, and you would probably need a tracking structure of some sort anyway to deal with the directory entry create/delete dependancies. :> In particular, caching a namespace deletion (rm, rename, etc) :> without touching the meta-data requires implementing a merged lookup :> so you can cancel-out the directory entry that still exists on-media. :> So it isn't entirely trivial. : :Linux implements "negative dentries" to say "this entry is not here". Yes, BSD's have those too (and always have), but like the positive entries they are a throw-away cache. :> blocks. Lookups have to take into account the cached truncation offset :> so as not to read data from the media past the truncation point (e.g. :> if you seek-write-extend or truncate-extend the file immediately :> after truncating the file). : :I have not looked at that part for a while now, and I did not look at :it just now, but Al Viro has been fiddling with it for years getting the :bugs out one at a time. The last major one I heard of was some time :ago. It works somehow, I should look more closely at it. I can believe that. It took me a while to get cached truncation offsets working in HAMMER. It's probably the most complex bit of code after the merged memory/media-B-Tree iterator. :> Jeeze, BSD's have been doing that forever. That's what VOP_BMAP is :> used for. I'm a little surprised that Linux doesn't do that yet. :> I'll expand on that down further. : :Either you forgot to expand or I missed it. I am interested. Oops, I think I did forget to expand that. Basically buffer cache buffers under kernel management are logically indexed. Each vnode has its own tree of buffer cache buffers. When the kernel wants to flush dirty buffers or cluster adjacent buffers together for I/O it uses VOP_BMAP to map the logical offsets to physical offsets (so kernel-supported clustering is thus based on physical offsets). The VFS is responsible for implementing the BMAP operation. It is the BMAP operation which assigns actual physical media blocks to a high level logically mapped filesystem buffer. The mapping is persistent as long as the buffer cache buffer exists. Of course clean buffers can be thrown away by the kernel so at some point in the future the kernel might re-request the VOP_BMAP for that logical offset. The VFS can choose when to implement the physical mapping. It can tell the kernel to take a hike and do it when the kernel actually writes the buffer out (being a logical buffer the flush goes through the VFS via VOP_STRATEGY), it can do it early on when the actual read() or write() opration occurs, etc. :I remember it well. You were the one who put Rik up the reverse map :design that caused the main fireworks between 2.3 and 2.4.9. (I was :the one who finally got it work by social-engineering the merging of :the reverse map part with Andrea's more robust LRU design.) : :I also remember the time when BSD buffers were far superior to Linux :ones, yes. In 2.0 the arrangement sucked enormously: sort-of coherency :between the buffer cache and the page cache was achieved by physically :copying changes from one to the other. : :Today, the buffer cache and page cache are nearly fully unified. But :the unification will never be complete until the page size is variable, :so the remaining tasks done by buffers can be done by pages instead. : :Anyway, this part of your OS is more similar than different, which will :help a lot with a port. Hey, I like the term fireworks. The VM system is the most complex piece of the kernel, in both Linux and BSD. I'm not surprised. BSDs have always had a unified buffer / VM cache, but until recently VFS's have had to have knowledge of the VM system to implement VOP_PUTPAGES and VOP_GETPAGES. One of the things I did in DragonFly was finally finish that unification up so VFSs now only need to implement VOP_READ and VOP_WRITE, and do not have to implement VOP_PUTPAGES and VOP_GETPAGES any more. BSD buffer cache buffers are backed by vm_page_array[], an array of VM pages, and the base of the buffer does not have to be page-aligned. So variable page sizes are not necessary. FreeBSD is about 80% of the way there. They haven't taken the final steps yet. I don't know about NetBSD and OpenBSD. :Indeed, I need to handle that. Since I do want to stick with a simple :linear sequence of logical log commits for now, I do not want to leave :any of them sitting around for a log time. One easy thing to do is to :put any orphans aside at rollup time, in an unversioned orphan file, say. I had a little time to think about it. I think the easiest solution is, if you are forced to flush an open inode with 0 links, to actually record the open state in a separate B-Tree so upon remounting after a crash you can simply scan that B-Tree and finish off the inodes you find in that tree. That way it would not have to be held in the log at all as anything other then a transitory entry. :Linux LVM is all about device mapper, and actually, device mapper does :not really have a reason to be a separate subsystem, it can just be the :way the block layer works. If interested, check out my posts on bio :stacking and rolling device mapper's "clone_and_map" idea up into the :generic block layer. I think all unixen should do this. I'll look it up! :Daniel : -Matt Matthew Dillon <dillon@backplane.com>
From: Daniel Phillips Subject: Re: [Tux3] Comparison to Hammer fs design Date: Aug 5, 1:21 am 2008 On Sunday 27 July 2008 20:46, Matthew Dillon wrote: > :Tux3 never backs anything out, so there is some skew to clear up. > > Got it. I'm still not clear on the physical vs logical log > interactions required for crash recovery, but I think I understand > everything else. The reason for my unclear crash recovery algorithm is, it was not fully designed when we started chatting, only the basic principles of robust micro-transaction commit. You forced me to go work out the details of the high level transaction architecture. The big points you got me focussed on were, what about mass commits a la fsync, and how do you sort out the ties that arise between otherwise independent high level operations, and how do you preserve the ordering of low level operations to reflect the ordering constraints of the high level operations? Which is what you were trying to say with your create a, create a/b example, I now realize. > :> One of the reasons why HAMMER uses an UNDO log is that it is a lot > :> less expensive then a REDO log. I'll show you why: > :> > :> Physical operations related to a single flush group in HAMMER: > :> > :> modify element in B-Tree node A (create UNDO w/ original contents of > :> B-Tree node A) > :> modify element in B-Tree node A (no new UNDO needs to be created) > :> modify element in B-Tree node A (no new UNDO needs to be created) > :> modify element in B-Tree node A (no new UNDO needs to be created) > :> modify element in B-Tree node A (no new UNDO needs to be created) > : > :Nice, now I understand. But do you not have to hold all filesystem > :transactions that depend on the modification until the btree node has > :completed writing to disk? With logical logging you only have to wait > :on the logical commit to complete, into which may be packed a number of > :other changes for efficiency. > > The only major interlock between frontend searches and backend flushes > occurs on an inode-by-inode basis and only for the duration of the > modifying B-Tree operation, which does *not* include the writing of > the modified meta-data buffer to disk. I am well down path to identifying a similar frontend/backend separation in Tux3. The back end data consists of all the dirty metadata and file data cache blocks (buffer cache and page cache(s) respectively) and the "phase" boundaries, which separate the results of successive large groups of high level operations so that there are no disk-image dependencies between them. I now have two nice tools for achieving such a clean separation between transaction phases: 1) Logical log records 2) Dirty cache block cloning The latter takes advantage of logical log records to limit the effect of a high level operation to just the leaf blocks affected. Where two operations that are supposed to be in two different transaction phases affect the same leaf block(s) then the block(s) will be cloned to achieve proper physical separation. So I can always force a clean boundary between phases without generating a lot of dirty blocks to do it, an important property when it comes to anti-deadlock memory provisions. The payoff for a clean frontend/backend separation is, as you say, a clean design. The art in this is to accomplish the separation without any throughput compromise, which I am willing to believe you have done in Hammer, and which I now wish to achieve by a somewhat different means in Tux3. > Once the operation concludes > we are left with dirty meta-data buffers which for all intents and > purposes have transfered the in-memory cached B-Tree records (held > as separate entities in a red-black tree in-memory) into the media > view of the B-Tree (having performed the actual B-Tree operations > necessary to insert or delete the record), and the inode lock is > then released. The flusher flushes each inode on the flush list in > turn, with certain long-lived inodes (e.g. truncation of a 1TB > file to 0) only getting partial goodness and then being moved to > the next flush list. The inode lock is held for each individual > inode as it is being flushed, one inode at a time. I need a means of remembering which blocks are members of which phases, and an rbtree may be indeed better than the list I was thinking of. A hash table is probably ok too. I also have to remember the logical update records that were staged when the high level operation took place in order to avoid having the btree leaf changes climb up into the index nodes copy on write style, and to avoid constant changes to the free map btree. I can coopt some bits of the buffer flags to keep track of which phase a given dirty buffer belongs to. To avoid having to attach buffers to every dirty page cache page, I might use some page flags in that case. Or that state can be in a secondary bookkeeping structure like the rbtree or hash. Details to be settled before the kernel coding starts. > The actual dirty meta-data buffers are flushed later on, without > any inode lock held and without having any blocking effect on the > frontend. Roger, same here. The inode locks and things like rename lock are only held during the course of the high level operation. Tux3 does not have to worry about that at all, the VFS takes care of it nicely. Tux3 only has to lock the data structures that keep track of the phases (spinlocks) and the btrees embedded in the buffer and page cache (possibly hashed mutexes). Note: when I write VFS I mean "virtual filesystem switch" and when you write VFS you mean "virtual filesystem", which in Linux is just called a filesystem. > This media-write sequence occurs at the end, after all > inodes have been processed. UNDO finishes writing, we wait for > I/O to complete, we update the volume header, we wait for I/O > to complete, then we asynchronously flush the dirty media buffers > (and don't wait for that I/O to complete so the writes can be merged > with the beginning of the next flush group). Here, Tux3 (and Hammer) can always set up a very long forward log chain, only updating the header once per multi-transaction commit phase. > That is what I meant when I said the frontend was disconnected > from the backend. The worst the frontend ever does is issue reads > on media meta-data as part of any merged memory-media B-Tree searches > it must perform to locate information. Any modifying operations made > by the frontend will create records in the in-memory red-black tree, > but will *NOT* do any modifying oprations on meta-data buffers > representing the on-media B-Tree. Frontend lookups are resolved by > doing a merged-search with two cursors (one on the in-memory red-black > tree, and one on the on-media B-Tree). It is beautiful, now that I understand it. > I considered doing the interlock at the record level but hit > some hicups in that one of the cursors may have already passed the > memory record being committed but the other cursor (the media view) > may not have yet passed the B-Tree index where that record would be > written to. So if a commit were to occur right then, without a > governing inode lock, the merged search might see a duplicate record > (or no record in either view in the case of a deletion). Interlock at the micro-transaction level really sucks. There are way too many interlocks. It is much nicer just to have the phase boundaries, and not care about the dependencies between micro transactions withing the transaction phase, knowing that they will all be satisfied because the entire phase will be committed atomically. Now here is another way of looking at the difference between undo and redo. Each of my redo log entries leaves the filesystem meta-metadata in a consistent state in the sense that all btree structure will be consistent. But the logical metadata may be inconsistent, for example, a directory may have appear to have been deleted on disk while some deletions of files it contained were not persistently recorded. I think one can take advantage of such partial, low level consistency in order to do a rapid scan and repair, preserving as much as possible of what the filesystem was doing before the crash of a busy system with lots of transactions in flight. > :> I just run the UNDOs backwards, which allows me to cache the in-flush > :> undo areas with a fixed set of small throw-away tracking structures > :> in system memory. If a structure gets thrown away it just means an > :> extranious UNDO record will be appended to the UNDO FIFO. > : > :I think I see it, but I have my doubts because you have to block > :transactions waiting for the up to date copy of the btree data to land > :on disk. Either that, or you may give userspace the impression that > :some transaction has gotten onto stable storage when that is not the > :case. > > Userland doesn't care whether the actual transactions have made it > to disk or not unless it actually calls fsync(). If it does then > it will block through the necessary flush cycles. fsync() on > HAMMER is pretty slow as it will likely be syncing a lot more > in the flush group then just the requested inode. Yes, this was the point that finally made me understand how your undo approach works. Sorry for making you explain it so many times ;-) > If userland does not call fsync() (or 'sync', etc) then it doesn't > block at all, except as described above during the modify-buffer > phase of the B-Tree updates (on an inode-by-inode basis so actual > blocking is virtually non-existant). It is often possible to copy > an entire directory tree without initiating a backend flush. I presume that if one userland process is blocked on fsync, the others are still able to continue dumping their operations into cache? > Media B-Tree elements, inclusive of any held in dirty meta-data buffers, > are locked on a B-Tree node by B-Tree node basis. These are very > short lived locks and have nothing to do with the actual flushing > of the dirty meta-data buffers to the physical media. The frontend > always gets shared locks, the backend always gets exclusive locks. > Some blocking between the frontend and backend occurs but not a whole > lot. The backend flusher actually blocks against itself more often > (it runs multiple threads), due to the locality of reference within > the B-Tree for nearby inodes. There is no need for multiple flushing threads in Tux3, I will use bio submissions instead, which are completely asynchronous, returning completion status in interrupt mode as they do. > I'd recommend dropping support for NFSv2. It is not really worth > supporting any more. Does it even support 64 bit inodes? (I don't > remember), or 64 bit file offsets? NFSv2 is garbage. It's a thought. But anyway, I don't see a big downside to the PHTree approach. It might even end up faster than HTree (which does not maintain dirent stability) by virtue of the slightly smaller (5-10%) cache footprint. If classic HTree turns out to be faster, I could offer both as an option. The code that differs between them should be about 10-20%. > You should be able to use 63 bits of the cookie, I don't know why > you wouldn't use the high bit of the lsb 32 bit part. There is no > requirement that that bit be 0. In fact, the RFC says the cookie is > a 64 bit unsigned integer and you should be able to use all 64 bits. > If linux is not allowing all 64 bits to be used then it's a serious > bug in linux. The cookies are supposed to be opaque, just like the > file handle. You are right about the cookie 64 bitness, I don't know where I picked up that lore, mea culpa. > :> which means that records in the B-Tree aren't even close to being > :> in any sort of lexical order. I will probably wind up changing > :> the hash calculation a bit to get some lexical sorting information > :> stuffed into it... should be doable since the hash keys are 64 bits. > : > :Yes, I noticed that. Check out dx_hack_hash: > : > : http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/ext3/hash.c#L38 > :.. > :It distributed hashes of ascii strings very well for me, with few > :funnels. It way outperformed some popular hashes like TEA. Ted's cut > :down crytographic hash is yet more uniform but costs much more CPU. > > I like :-). I'll probably end up with multiple directory hash formats > too before this is over. Fortunately I have room in the inode to > specify the format the directory will use. I overlooked the part where you mentioned giving some lexical ordering to your hash to get better locality between lexically related records. I am not sure this is a win, it will require a lot of analysis, which starts to drift off the critical path for me, since crystallizing the transaction design is really a lot more important. An inverse idea to making the hash nonrandom is to make the goal inum depend somewhat on the name hash for a newly created file. So relatively fewer PHTree index leaf nodes, which contain [hash, dirblock] pairs, will be have to be reloaded into cache during deletion of a massive directory in physical dirent order. The goal inum has to depend mostly on the physical dirent location though, to avoid thrashing the inode table during mass creation, deletion or stat. I am not sure whether there is a gain to be had vs keeping the random hash completely random, which does deliver a benefit in balancing the tree better. > :Yes, I have thought about it a little more, and I imagine something > :like a small array of dirty bits that climb up the btree with the help > :of logical logging, where each bit means that there is something > :interesting for some corresponding replication target to look at, > :somewhere down in the subtree. > > You can do with one transaction id / mtime field in the B-Tree node > header. Strictly speaking you don't need a per-element field on > internal elements, though I have one there too because I had the space. Yes, the common case of replication is only against the "current" version and that optimization will work well there. There might be multiple current versions in a Tux3 volume, In the case of many virtual servers running off the same filesystem volume, each with its own writable snapshot, and each replicating thatsnapshot somewhere, I need to be able to track dirtiness in multiple versions simultaneously. Or just not worry to much about that and sweep through the entire inode table btree. Multiple dirty block streams could be output in one pass in this case. > :> Not for directory entries. I'm pretty sure Linux uses a throw-away > :> namecache just like we do, so unless it has a mechanic for flush > :> sequencing of dirty namecache entries you need in-VFS structural > :> support for caching the operations. > : > :The Linux dentry cache actually implements proper namespace semantics > :all by itself without needing to flush anything, which is what Ramfs > :is. Ramfs just lives entirely in cache without even being backed by > :a ramdisk, until the power goes off. > > That's cool. You will want to flush it since you are backing the > system caches with a real filesystem, and you would probably need > a tracking structure of some sort anyway to deal with the > directory entry create/delete dependancies. It flushes itself out when it is not pinned by something like ramfs for example, which has no backing store. The VFS takes care of the create/delete dependencies. > :> Jeeze, BSD's have been doing that forever. That's what VOP_BMAP is > :> used for. I'm a little surprised that Linux doesn't do that yet. > :> I'll expand on that down further. > : > :Either you forgot to expand or I missed it. I am interested. > > Oops, I think I did forget to expand that. Basically buffer cache > buffers under kernel management are logically indexed. Each vnode > has its own tree of buffer cache buffers. > > When the kernel wants to flush dirty buffers or cluster adjacent > buffers together for I/O it uses VOP_BMAP to map the logical offsets > to physical offsets (so kernel-supported clustering is thus based > on physical offsets). The VFS is responsible for implementing the BMAP > operation. It is the BMAP operation which assigns actual physical > media blocks to a high level logically mapped filesystem buffer. > > The mapping is persistent as long as the buffer cache buffer exists. > Of course clean buffers can be thrown away by the kernel so at some > point in the future the kernel might re-request the VOP_BMAP for > that logical offset. > > The VFS can choose when to implement the physical mapping. It can > tell the kernel to take a hike and do it when the kernel actually > writes the buffer out (being a logical buffer the flush goes through > the VFS via VOP_STRATEGY), it can do it early on when the actual > read() or write() opration occurs, etc. That sounds a lot like the <fs>_get_block interface the Linux filesystem block IO library uses to map buffers to physical addresses. That is just a set of convenience functions the filesystem can use to satisfy such vmm calls into the filesystem as ->writepage. The filesystem just has to supply its own _get_block method. Between that and mount options parsing, there is not a whole lot more you have to do to bring up a filesystem on Linux, that is, if you are not really worried about performance. Mapping each block with a separate call into the filesystem actually sucks pretty badly CPU wise, but this is masked by slow spindle speeds that stretch the CPU out over time. That masking disappears with some of the high performance storage hardware coming out now and the wasteful effects of that interface start to show. > BSD buffer cache buffers are backed by vm_page_array[], an array of > VM pages, and the base of the buffer does not have to be page-aligned. > So variable page sizes are not necessary. vm_page_array... is that like a Linux page cache? (aka mapping, aka address_space) Buffers on page cache pages do not have to be page-aligned in Linux, or rather, each page in Linux is either fully covered with buffers or has none at all. What sucks about all this is having buffers _and_ page descriptors, where the two perform very similar functions and duplicate a lot of state, causing endless bugs due to skew in the cached state between the two kinds of objects. > :Indeed, I need to handle that. Since I do want to stick with a simple > :linear sequence of logical log commits for now, I do not want to leave > :any of them sitting around for a log time. One easy thing to do is to > :put any orphans aside at rollup time, in an unversioned orphan file, say. > > I had a little time to think about it. I think the easiest solution > is, if you are forced to flush an open inode with 0 links, to actually > record the open state in a separate B-Tree so upon remounting after a > crash you can simply scan that B-Tree and finish off the inodes > you find in that tree. That way it would not have to be held in the > log at all as anything other then a transitory entry. That is exactly the idea. The question is whether to have a dedicated btree or to use a file which is also a btree. The orphan file (if that turns out to be the mechanism) would implement a simple linked list of orphans, and the transitory log entries would reduce the frequency the orphan file has to be updated, allowing it to be updated in batches. This post required a week to respond to because I thought I should also post some code in parallel. Regards, Daniel
From: Matthew Dillon Subject: Re: [Tux3] Comparison to Hammer fs design Date: Aug 5, 10:33 am 2008 :The payoff for a clean frontend/backend separation is, as you say, a :clean design. The art in this is to accomplish the separation without :any throughput compromise, which I am willing to believe you have done :in Hammer, and which I now wish to achieve by a somewhat different :means in Tux3. It's been a major advantage for HAMMER. The major improvement I made to make sure throughput was retained was to create a space reservation subsystem based around a structure called hammer_reserve. This allows the frontend to reserve space on the media without having to modify any meta-data or commit to the allocation. This in turn allows the frontend to reserve space for bulk data writes and actually perform the writes directly to the media, and then queue just the meta-data, along with the reservation to the backend. The backend then finalizes the reservation (doing the actual allocation) at the same time it lays down the related meta-data. The result is that you get nice pipelining when writing lots of bulk data. It is possible to literally write gigabytes of bulk data before the backend needs to flush any of the related meta-data. Separating the two out also made testing the filesystem a lot easier. Most of the frontend could be tested without a working backend. :Note: when I write VFS I mean "virtual filesystem switch" and when you :write VFS you mean "virtual filesystem", which in Linux is just called :a filesystem. Yah :-) :Now here is another way of looking at the difference between undo and :redo. Each of my redo log entries leaves the filesystem meta-metadata :in a consistent state in the sense that all btree structure will be :consistent. But the logical metadata may be inconsistent, for example, :a directory may have appear to have been deleted on disk while some :deletions of files it contained were not persistently recorded. I think :one can take advantage of such partial, low level consistency in order :to do a rapid scan and repair, preserving as much as possible of what :the filesystem was doing before the crash of a busy system with lots of :transactions in flight. Yah, I hit up against similar issues which I resolved with the 'flush group' abstraction, which groups together dependant operations. If the group would become too large (e.g. you have an arbitrarily long chain of dependancies), it can break the chain by flushing out the inode at the break point with an adjusted nlinks count. That way each flush group can be made logically consistent. :... :> phase of the B-Tree updates (on an inode-by-inode basis so actual :> blocking is virtually non-existant). It is often possible to copy :> an entire directory tree without initiating a backend flush. : :I presume that if one userland process is blocked on fsync, the others :are still able to continue dumping their operations into cache? Yah. fsync() just queues the inode to the flusher and the flusher bangs away at it until its all out, then wakes up the waiting fsync code. :... :> Some blocking between the frontend and backend occurs but not a whole :> lot. The backend flusher actually blocks against itself more often :> (it runs multiple threads), due to the locality of reference within :> the B-Tree for nearby inodes. : :There is no need for multiple flushing threads in Tux3, I will use bio :submissions instead, which are completely asynchronous, returning :completion status in interrupt mode as they do. The issue HAMMER had is that the flusher has to perform multiple media B-Tree operations, which can block on reads. I wound up going with multiple threads to parallelize the blocked reads. :It's a thought. But anyway, I don't see a big downside to the PHTree :approach. It might even end up faster than HTree (which does not :maintain dirent stability) by virtue of the slightly smaller (5-10%) :cache footprint. If classic HTree turns out to be faster, I could offer :both as an option. The code that differs between them should be about :10-20%. None of the standards require directory stability within the cached dirent buffer. That is, if libc caches just the base seek point for a set of directory entries, there is no requirement that the positioning of elements beyond the first element at that base seek point be consistent. The standards only require stability at explicit seek points. In fact, the standards explicitly say that trying to seek around inside a directory after performing a create or delete has undefined operation. The only thing you can count on is a linear iteration of the directory. :> You should be able to use 63 bits of the cookie, I don't know why :> you wouldn't use the high bit of the lsb 32 bit part. There is no :> requirement that that bit be 0. In fact, the RFC says the cookie is :> a 64 bit unsigned integer and you should be able to use all 64 bits. :> If linux is not allowing all 64 bits to be used then it's a serious :> bug in linux. The cookies are supposed to be opaque, just like the :> file handle. : :You are right about the cookie 64 bitness, I don't know where I picked :up that lore, mea culpa. I seem to recall linux having issues with those bits too, it did strike a note. I don't remember from where but it was a long time ago. Frankly, any linuxisms that don't allow the use of at least 63 bits need to be fixed in linux. :> I like :-). I'll probably end up with multiple directory hash formats :> too before this is over. Fortunately I have room in the inode to :> specify the format the directory will use. : :I overlooked the part where you mentioned giving some lexical ordering :to your hash to get better locality between lexically related records. :I am not sure this is a win, it will require a lot of analysis, which :starts to drift off the critical path for me, since crystallizing the :transaction design is really a lot more important. : :An inverse idea to making the hash nonrandom is to make the goal inum :depend somewhat on the name hash for a newly created file. So :relatively fewer PHTree index leaf nodes, which contain [hash, dirblock] :pairs, will be have to be reloaded into cache during deletion of a :massive directory in physical dirent order. The goal inum has to depend :mostly on the physical dirent location though, to avoid thrashing the :inode table during mass creation, deletion or stat. I am not sure :whether there is a gain to be had vs keeping the random hash completely :random, which does deliver a benefit in balancing the tree better. That's an interesting idea. :> BSD buffer cache buffers are backed by vm_page_array[], an array of :> VM pages, and the base of the buffer does not have to be page-aligned. :> So variable page sizes are not necessary. : :vm_page_array... is that like a Linux page cache? (aka mapping, aka :address_space) Buffers on page cache pages do not have to be Yes. :page-aligned in Linux, or rather, each page in Linux is either fully :covered with buffers or has none at all. What sucks about all this is :having buffers _and_ page descriptors, where the two perform very :similar functions and duplicate a lot of state, causing endless bugs :due to skew in the cached state between the two kinds of objects. BSD buffers can overlap pages. e.g. you can have 4K pages and 6K buffers, so every other page is overlapped by two buffers. FreeBSD and DragonFly take care of keeping track what portions of the VM pages are actually dirty (and it gets more complicated if the pages are mapped into userland). :This post required a week to respond to because I thought I should also :post some code in parallel. : :Regards, : :Daniel Don't wear yourself out! This conversation has been fun and I've gotten a lot of good ideas out of it, plus a much better understanding of REDO. -Matt Matthew Dillon <dillon@backplane.com>
From: Daniel Phillips Subject: [Tux3] Two kinds of atomic commit Date: Jul 27, 10:46 pm 2008 Here I describe two slightly different methods that Tux3 will use to implement atomic commit of data and metadata. Both methods combine logical and physical forward logging to perform atomic commits efficiently. The discussion below assumes we are updating a btree leaf, for example to make room for more inode data or to add pointers to a file index. The same techniques apply equally well to all structures in the filesystem. 1) The Update a Clone Method Instead of directly modifying a disk block corresponding to a btree leaf, we allocate a new block for the leaf and copy the contents of the original block to the new block, only in the buffer cache (no copy is performed on disk). We can now alter the new block at will and flush it out to disk without affecting the on-disk btree at all. But have not yet linked the new block into the btree. We could accomplish that by performing a similar clone update recursively up to the root of the tree, which creates a new tree root. The whole chain of new blocks would then be flushed out to disk and a pointer to the new root stored at some predictable location so it can be found later. This is the "phase tree" method that I invented for Tux2, and is commonly called "copy on write" these days. It could also be called the "sue me" method because Netapp likes to sue those such as Sun who implement it. Fortunately, there is a better way to do it that I invented recently and which I have never heard of anyone using before. We modify only the leaf node of the btree by cloning and record the pointer to the clone only in the cached btree index block, not on disk. To be able to reconstruct the cached version of the index node after a crash, we log a logical change record to the disk that says "write this leaf into that btree index node". We make whatever changes we need to the clone of the leaf node, then construct a two block transaction consisting of the clone and a commit block. The commit block points at the new leaf node and also carries the logical update record that describes how to modify the parent index node recorded on disk to point at the new leaf. If the commit block can be allocated immediately adjacent to the clone then we can construct a single bio to transfer the clone and the commit block to disk in one operation. Otherwise we construct two bios. If there is some previous commit transaction that has been staged but not yet committed, then we modify its commit block to point at the location where our new commit block will be stored. Otherwise we have to seek to some known location on the disk to store the location of the new commit. To be able to tell after a crash whether the commit block and its associated data made it to disk, we store a sequence number to distinguish this commit block from a possible similar commit that might have landed accidentally at the same location earlier, and store a hash of the two blocks in the commit block. Now the transaction is ready to go out to disk. But if the disk is busy with previous traffic (we hope it is) then we just stage the transaction for commit, and commit the predecessor transaction instead. This gives a chance for any following transaction to extend the forward log chain instead of having to start a new chain, which would need an extra seek and transfer. The original version of the leaf node on disk is not yet freeable, because it may be needed after a crash (together with the logical update record) to reconstruct the cached btree node image. Only after we "roll up" the logical update log entry by atomically updating the parent node may we free the original leaf block and the commit block that recorded the logical update. Such a rollup happen periodically, otherwise we would end up with an unreasonably long chain of logical updates needing to be replayed on remount to reconstruct the btree image in cache. 2) The Update in Place Method An alternate method of atomically updating our btree leaf node is to write it to disk twice: the first time along with a commit block that says we intend to overwrite the leaf node with the data part of the transaction, and the second write does the actual overwrite. If we crash before completing the second write, replay will load the image of the btree block from the commit transaction into cache and we are ready to continue where we left off. Like the clone method, the update in place method constructs a transaction, tries to link it into an existing forward log chain, or falls back to writing the commit block location to a known location if it has to, then stages the transaction for commit, committing the predecessor transaction in the forward chain if there is one. The overwrite transfer can be submitted as soon as the commit transaction has completed, which might consist of one or two bio transfers depending on whether it was possible to allocate the transaction contiguously or not. It is possible that the updated leaf block may be needed as the base for applying logical updates to reconstruct the cached version of the leaf as a consequence of some future transaction, but the future transaction must be able to rely on a known state of the leaf if it wants to log logical changes against it. Therefore the future transaction may have to wait on the update in place transaction to complete as well, to guarantee that the desired leaf state can be reconstructed. Any future update in place of the same block has to wait for the overwrite transfer to complete before submitting its own overwrite transfer, otherwise the second overwrite may race ahead of the first creating a stale disk block. The whole update in place transaction can be freed as soon as the overwrite transfer completes. Comparison of the methods Both methods are very efficient ways to achieve atomic commit. The update in place method has the advantage of not perturbing the original disk layout, possibly reducing fragmentation. The update clone method has the advantage of one less write operation. Tux3 will provide a mount option to specify which method to use. A related technique is atomic append. A whole range of data blocks can be written out along with a commit block in a single bio transfer, or a small number of bio transfers in one transaction, along with a logical record in the commit block to say which file the blocks are to be appended, in the form of a logical update in the commit block which will eventually result in the addition of one or more extents to the file index, update of the size and mtime, and removal of the new data blocks from the free map. Thus, a file write of considerable size can be atomically committed with a single bio transfer. Regards, Daniel _______________________________________________ Tux3 mailing list Tux3@tux3.org http://tux3.org/cgi-bin/mailman/listinfo/tux3
From: Matthew Dillon Subject: Re: [Tux3] Two kinds of atomic commit Date: Jul 28, 9:58 am 2008 :1) The Update a Clone Method : :2) The Update in Place Method : :Daniel Hmm. Those seem fairly complex. How would you deal with incidental operations on the B-Tree such as doing a split? A single insert could wind up requiring a split of every internal node on the way down to the leaf. I guess you could clone the blocks, but there are a ton of parent and child linkages that have to be changed when you split a node. It sounds pretty expensive. Maybe implementing UNDOs for incidental B-Tree operations is the ticket. Incidental meaning those operations (such as splits) which lead up to an insertion or deletion but do not actually perform the insertion or deletion. On crash recovery you would undo partially completed B-Tree operations and then REDO the related logical operations. Having to allocate new blocks for B-Tree deletions could create a big issue when the filesystem becomes full or near-full. -Matt Matthew Dillon <dillon@backplane.com> _______________________________________________ Tux3 mailing list Tux3@tux3.org http://tux3.org/cgi-bin/mailman/listinfo/tux3
From: Daniel Phillips Subject: Re: [Tux3] Two kinds of atomic commit Date: Jul 28, 12:52 pm 2008 On Monday 28 July 2008 09:58, Matthew Dillon wrote: > :1) The Update a Clone Method > : > :2) The Update in Place Method > : > :Daniel > > Hmm. Those seem fairly complex. Then I need to rewrite the post so it seems as simple as it is :-) > How would you deal with incidental > operations on the B-Tree such as doing a split? A single insert > could wind up requiring a split of every internal node on the way > down to the leaf. I guess you could clone the blocks, but there are > a ton of parent and child linkages that have to be changed when you > split a node. It sounds pretty expensive. In this case a log transaction is created containing all of the split nodes as physical updates and possibly some merged nodes from the free tree. Btree splitting can always be committed atomically and independently of any other activity on the filesystem. It is nicely bounded by the btree depth. Allocations that do not require splitting the free tree are logged logically, leaving the affected free tree blocks dirty in cache. So the allocation updates come "for free", being tucked into the same commit block that governs the split btree leaves. > Maybe implementing UNDOs for incidental B-Tree operations is the ticket. > Incidental meaning those operations (such as splits) which lead up to > an insertion or deletion but do not actually perform the insertion > or deletion. On crash recovery you would undo partially completed > B-Tree operations and then REDO the related logical operations. I now see how UNDO works, thankyou for your patient explanations. The point I overlooked is, fsync (and friends) is indeed the only barrier you have to worry about. Still, I think it is good to try to get as much as possible of what was going on in a bit burst of activity with no fsync durably onto disk. Redo will clearly leave the filesystem in a consistent state less far back in time than undo. So my general strategy is to log big changes like splits as "physical" full block updates and small ones like allocating an extent as logical edit records in the commit blocks of the physical updates. The complexity you noted above is due to making sure that the on-disk image of a block is always what the logical edit record expects it to be at the time the logical edit is replayed. > Having to allocate new blocks for B-Tree deletions could create a big > issue when the filesystem becomes full or near-full. Yes, allocate-in-free is usually nasty. I need to ensure that there is always a reserve of blocks available on disk that is more than the maximum possible transaction that can be accepted. This simple idea can get really complex, I know. One nice trick to simplify things a little is to have a really generous limit on the maximum number of dirty blocks that are allowed, when the filesystem has lots of free space, and shrink that limit progressively as free space approaches zero. I now see what you were driving at with your point about interlocking namespace transactions, etc. While each VFS transaction can indeed be committed on its own, atomically and independently, to do so would be death for throughput. So it is necessary to batch up a lot of these transactions, and be mindful of the interdependencies between the VFS transactions and the underlying data structures. The rule is, underlying data changes required by any given VFS transaction must never lie in more than one commit. This can be accomplished without actually tracking the physical representation interdependencies between VFS transactions. Instead, just count the dirty cache blocks and send out an atomic commit for the entire set of transactions when some threshold is passed. Now the challenge is to figure out how to avoid stalling during the commit. It is thanks to you, your searching questions and the example of Hammer that I was forced to understand these issues clearly. :-) Note! In Linux, VFS means "Virtual Filesystem Switch", that is, the security, synchronization and abstraction layer that exists above filesystems in Linux. I think it means "Virtual File System" to you, which is just "filesystem" to Linux hackers. Regards, Daniel _______________________________________________ Tux3 mailing list Tux3@tux3.org http://tux3.org/cgi-bin/mailman/listinfo/tux3


My head aches
My head aches by just looking at these posts. These guys are seriously immerged in their work.
It's great to see to people
It's great to see to people on what are essentially "competing" operating systems and "competing" filesystem projects will to share the technical and design process.
Precisely the sort of collaboration that attracted me to OSS in the first place.
It's great to see to people
It's great to see to people on what are essentially "competing" operating systems and "competing" filesystem projects willing to share the technical and design process.
Precisely the sort of collaboration that attracted me to OSS in the first place.