(resent after subscribing to the the kernel@crater) On Friday 25 July 2008 11:53, Matthew Dillon wrote:How about a cross post? Yes, that is the main difference indeed, essentially "log everything" vs "commit" style versioning. The main similarity is the lifespan oriented version control at the btree leaves. Once I get down to the leaf level I binary search on logical address in the case of a file index btree or on version in the case of an inode table block, so this cost is still Log(N) with a small k. For a heavily versioned inode or file region this might sometimes result in an overflow block or two that has to be linearly searched which is not a big problem so long as it is a rare case, which it really ought to be for the kinds of filesystem loads I have seen. A common example of a worst case is /var/log/messages, where the mtime and size are going to change in pretty much every version, so if you have hourly snapshots and hold them for three months it adds up to about 2200 16 byte inode table attributes to record, about 8 inode table leaf blocks. I really do not see that as a problem. If it goes up to 100 leaf blocks to search then that could be a problem. Usually, I will only be interested in the first and last of those blocks, the first holding the rarely changing attributes including the root of the file index btree and the last holding the most recent incarnation of the mtime/size attribute. The penultimate inode table index block tells me how many blocks a given inode lives in because several blocks will have the same inum key. So the lookup algorithm for a massively versioned file becomes: 1) read the first inode table block holding that inum; 2) read the last block with the same inum. The latter operation only needs to consult the immediate parent index block, which is locked in the page cache at that point. It will take a little care and attention to the inode format, especially the inode header, to make sure that I can reliably do that first/last probe optimization for the "head" version, but it does seem worth the effort. For starters I will just do a mindless linear walk of an "overflow" inode and get fancy if that turns out to be a problem. By far the most common case I would think. But check out the versioned pointer algorithms. Surprisingly that just works, which is not exactly obvious: Versioned pointers: a new method of representing snapshots http://lwn.net/Articles/288896/ I was originally planning to keep all versions of a truncate/rewrite file in the same file index, but recently I realized that that is dumb because there will never be any file data shared by a successor version in that case. So the thing to do is just create an entirely new versioned file data attribute for each rewrite, bulking up the inode table entry a little but greatly constraining the search for versions to delete and reducing cache pressure by not loading unrelated version data when traversing a file. ...which happens in Tux3 courtesy of the fact that the entire block containing the dirent will have been versioned, with the new version showing the entry gone. Here is one of two places where I violate my vow to avoid copying an entire block when only one data item in it changes (the other being the atime table). I rely on two things to make this nice: 1) Most dirent changes will be logically logged and only rolled up into the versioned file blocks when there are enough to be reasonably sure that each changed directory block will be hit numerous times in each rollup episode. (Storing the directory blocks in dirent-create order as in PHTree makes this very likely for mass deletes.) 2) When we care about this is usually during a mass delete, where most or all dirents in each directory file block are removed before moving on to the next block. Again, check out the versioned pointer algorithms. You can tell what can be pruned just by consulting the version tree and the create_tids (birth versions) for a particular logical address. Maybe the hang is that you do not organize the btrees by logical address (or inum in the case of the inode table tree). I thought you did but have not read closely enough to be sure. Fair enough. I have an entirely different approach to what you call mirroring and what I call delta replication. (I reserve the term mirroring to mean mindless duplication of physical media writes.) This method proved out well in ddsnap: http://phunq.net/ddtree?p=zumastor/.git;a=tree;h=fc5cb496fff10a2b03034fcf95122f5828149... (Sorry about the massive URL, you can blame Linus for that;-) What I do in ddsnap is compute all the blocks that differ between two versions and apply those to a remote volume already holding the first of the two versions, yielding a replica of the second version that is logically but not physically identical. The same idea works for a versioned filesystem: compute all the leaf data that differs between two versions, per inum, and apply the resulting delta to the corresponding inums in the remote replica. The main difference vs a ddsnap volume delta is that not all of the changes are physical blocks, there are also changed inode attributes, so the delta stream format has to be elaborated accordingly. I intend to log insertions and deletions logically, which keeps each down to a few bytes until a btree rollup episode comes along to perform updating of the btree nodes in bulk. I am pretty sure this will work for you as well, and you might want to check out that forward logging trick. That reminds me, I was concerned about the idea of UNDO records vs REDO. I hope I have this right: you delay acknowledging any write transaction to the submitter until log commit has progressed beyond the associated UNDO records. Otherwise, if you acknowledge, crash, and prune all UNDO changes, then you would end up with applications believing that they had got things onto stable storage and be wrong about that. I have no doubt you did the right thing there, but it is not obvious from your design documentation. Yes, I mostly have 16 byte elements and am working on getting most of them down to 12 or 8. I use two-stage lookup, or four stage if you count searches within btree blocks. This makes the search depth smaller in the case of small files, and in the case of really huge files it adds depth exactly as appropriate. The index blocks end up cached in the page cache (the buffer cache is just a flavor of page cache in recent Linux) so there is little repeated descending through the btree indices. Instead, most of the probing is through the scarily fast radix tree code, which has been lovingly optimized over the years to avoid cache line misses and SMP lock contention. I also received a proposal for a "fat" btree index from a collaborator (Maciej) that included the file offset but I really did not like the... fattening. A two level btree yields less metadata overall which in my estimation is more important than saving some bree probes. The way things work these days, falling down from level 1 cache to level 2 or from level 2 to level 3 costs much more than executing some extra CPU instructions. So optimization strategy inexorably shifts away from minimizing instructions towards minimizing cache misses. Fortunately, we get that for free on Linux, courtesy of the page cache radix trees :-) I might eventually add some explicit cursor caching, but various artists over the years have noticed that it does not make as much difference as you might think. Indeed. But Linux is braindamaged about large block size, so there is very strong motivation to stay within physical page size for the immediate future. Perhaps if I get around to a certain hack that has been perenially delayed, that situation will improve: "Variable sized page objects" http://lwn.net/Articles/37795/ Topological bloat? Ah, that is a very nice benefit of Tux3-style forward logging I forgot to mention in the original post: transaction size is limited only by available free space on the volume, because log commit blocks can be anywhere. Last night I dreamed up a log commit block variant where the associated transaction data blocks can be physically discontiguous, to make this assertion come true, shortly after reading Stephen Tweedie's horror stories about what you have to do if you must work around not being able to put an entire, consistent transaction onto stable media in one go: http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html Most of the nasties he mentions just vanish if: a) File data is always committed atomically b) All commit transactions are full transactions He did remind me (via time travel from year 2000) of some details I should write into the design explicitly, for example, logging orphan inodes that are unlinked while open, so they can be deleted on replay after a crash. Another nice application of forward logging, which avoids the seek-happy linked list through the inode table that Ext3 does. Good point. I expect Tux3 will eventually have a reblocker (aka defragger). There are some really nice things you can do, like: 1) Set a new version so logical changes cease for the parent version. 2) We want to bud off a given directory tree into its own volume, so start by deleting the subtree in the current version. If any link counts in the directory tree remain nonzero in the current version then there are hard links into the subtree, so fail now and drop the new version. 3) Reblock a given region of the inode table tree and all the files in it into one physically contiguous region of blocks 4) Add a free tree and other once-per volume structures to the new home region. 5) The new region is now entirely self contained and even keeps its version history. At the volume manager level, map it to a new, sparse volume that just has a superblock in the zeroth extent and the new, mini filesystem at some higher logical address. Remap the region in the original volume to empty space and add the empty space to the free tree. 6) Optionally reblock the newly budded filesystem to the base of the new volume so utilties that do not not play well with sparse volumes do not do silly things. Cavalier was a poor choice of words, the post was full of typos as well so I am not proud of it. You are solving a somewhat different problem and you have code out now, which is a huge achievement. Still I think you can iteratively improve your design using some of the techniques I have stumbled upon. There are probably some nice tricks I can get from your code base too once I delve into it. Thankyou very much. I think you are on the right track too, which you have a rather concrete way of proving. The big ahas! that eliminated much of the complexity in the Tux3 design were: * Forward logging - just say no to incomplete transactions * Version weaving - just say no to recursive copy on write Essentially I have been designing Tux3 for ten years now and working seriously on the simplifying elements for the last three years or so, either entirely on paper or in related work like ddsnap and LVM3. One of the nice things about moving on from design to implementation of Tux3 is that I can now background the LVM3 design process and see what Tux3 really wants from it. I am determined to match every checkbox volume management feature of ZFS as efficiently or more efficiently, without violating the traditional layering between filesystem and block device, and without making LVM3 a Tux3-private invention. Hopefully not too much later. I find this dialog very fruitful. I just wish such dialog would occur more often at the design/development stage in Linux and other open source work instead of each group obsessively ignoring all "competing" designs and putting huge energy into chatting away about the numerous bugs that arise from rushing their design or ignoring the teachings of history. Regards, Daniel
| Justin C. Sherrill | Re: dragonflybsd.org website link? |
| David Woodhouse | Re: -mm merge plans for 2.6.23 |
| Greg Kroah-Hartman | [PATCH 002/196] Chinese: rephrase English introduction in HOWTO |
| Eric Sandeen | Re: [RFC] Heads up on sys_fallocate() |
git: | |
| David Miller | [GIT]: Networking |
| Gerrit Renker | [PATCH 15/37] dccp: Set per-connection CCIDs via socket options |
| Patrick McHardy | [NET_SCHED 01/15]: sch_atm: fix format string warning |
