linSubscribed now, everything should be OK. On Friday 25 July 2008 19:02, Matthew Dillon wrote:Yes. Writes tend to be highly parallel in Linux because they are mainly driven by the VMM attempting to clean cache dirtied by active writers, who generally do not wait for syncing. So this will work really well for buffered IO, which is most of what goes on in Linux. I have not thought much about how well this works for O_SYNC or O_DIRECT from a single process. I might have to do it slightly differently to avoid performance artifacts there, for example, guess where the next few direct writes are going to land based on where the most recent ones did and commit a block that says "the next few commit blocks will be found here, and here, and here...". When a forward commit block is actually written it contains a sequence number and a hash of its transaction in order to know whether the commit block write ever completed. This introduces a risk that data overwritten by the commit block might contain the same hash and same sequence number in the same position, causing corruption on replay. The chance of this happening is inversely related to the size of the hash times the chance of colliding with the same sequence number in random data times the chance of of rebooting randomly. So the risk can be set arbitrarily small by selecting the size of the hash, and using a good hash. (Incidentally, TEA was not very good when I tested it in the course of developing dx_hack_hash for HTree.) Note: I am well aware that a debate will ensue about whether there is any such thing as "acceptable risk" in relying on a hash to know if a commit has completed. This occurred in the case of Graydon Hoare's Monotone version control system and continues to this day, but the fact is, the cool modern version control systems such as Git and Mercurial now rely very successfully on such hashes. Nonetheless, the debate will keep going, possibly as FUD from parties who just plain want to use some other filesystem for their own reasons. To quell that definitively I need a mount option that avoids all such commit risk, perhaps by providing modest sized journal areas salted throughout the volume whose sole purpose is to record log commit blocks, which then are not forward. Only slightly less efficient than forward logging and better than journalling, which has to seek far away to the journal and has to provide journal space for the biggest possible journal transaction as opposed to the most commit blocks needed for the largest possible VFS transaction (probably one). Actually, the btree node images are kept fully up to date in the page cache which is the only way the high level filesystem code accesses them. They do not reflect exactly what is on the disk, but they do reflect exactly what would be on disk if all the logs were fully rolled up ("replay"). The only operations on forward logs are: 1) Write them 2) Roll them up into the target objects 3) Wait on rollup completion The rollup operation is also used for replay after a crash. A forward log that carries the edits to some dirty cache block pins that dirty block in memory and must be rolled up into a physical log before the cache block can be flushed to disk. Fortunately, such a rollup requires only a predictable amount of memory: space to load enough of the free tree to allocate space for the rollup log, enough available cache blocks to probe the btrees involved, and a few cache blocks to set up the physical log transaction. It is the responsibility of the transaction manager to ensure that sufficient memory to complete the transaction is available before initiating it, otherwise deadlock may occur in the block writeout path. (This requirement is the same as for any other transaction scheme). One traditional nasty case that becomes really nice with logical forward logging is truncate of a gigantic file. We just need to commit a logical update like ['resize', inum, 0] then the inode data truncate can proceed as convenient. Another is orphan inode handling where an open file has been completely unlinked, in which case we log the logical change ['free', inum] then proceed with the actual delete when the file is closed or when the log is replayed after a surprise reboot. Logical log replay is not idempotent, so special care has to be taken on replay to ensure that specified changes have not already been applied to the target object. I choose not to go the traditional route of providing special case tests for "already applied" which get really ugly or unreliable when there are lot of stacked changes. Instead I introduce the rule that a logical change can only be applied to a known good version of the target object, which promise is fullfilled via the physical logging layer. For example, on replay, if we have some btree index on disk for which some logical changes are outstanding then first we find the most recent physically logged version of the index block and read it into cache, then apply the logical changes to it there. Where interdependencies exist between updates, for example the free tree should be updated to reflect a block freed by merging two btree nodes, the entire collection of logical and physical changes has to be replayed in topologically sorted order, the details of which I have not thought much about other than to notice it is always possible. When replay is completed, we have a number of dirty cache blocks which are identical to the unflushed cache blocks at the time of a crash, and we have not yet flushed any of those to disk. (I suppose this gets interesting and probably does need some paranoid flushing logic in replay to handle the bizarre case where a user replays on a smaller memory configuration than they crashed on.) The thing is, replay returns the filesystem to the logical state it was in when the crash happened. This is a detail that journalling filesystem authors tend to overlook: actually flushing out the result of the replay is pointless and only obscures the essential logic. Think about a crash during replay, what good has the flush done? I do not see why this example cannot be logically logged in pieces: ['new', inum_a, mode etc] ['link', inum_parent, inum_a, "a"] ['new', inum_b, mode etc] ['link' inum_a, inum_b, "b"] ['new', inum_c, mode etc] ['link' inum_a_b, inum_c, "c"] ['new', inum_x, mode etc] ['link' inum_a_b, inum_x, "x"] ['new', inum_d, mode etc] ['link' inum_a_b_c, inum_d, "d"] Logical updates on one line are in the same logical commit. Logical allocations of blocks to record the possibly split btree leaves and new allocations omitted for clarity. The omitted logical updates are bounded by the depth of the btrees. To keep things simple, the logical log format should be such that it is impossible to overflow one commit block with the updates required to represent a single vfs level transaction. I suspect there may be some terminology skew re the term "transaction". Tux3 understands this as "VFS transaction", which does not include trying to make an entire write (2) call atomic for example, but only such things as allocate+write for a single page cache page or allocate+link for a sys_link call. Fsync(2) is not a transaction but a barrier that Tux3 is free to realize via multiple VFS transactions, which the Linux VFS now takes care of pretty well now after many years of patching up the logic. Indeed. I think I have identified a number of techniques for stripping away much of that complexity. For example, the governing parent update can be considered complete as soon as the logical 'link' log commit has completed. Correct! That is why there are two levels of logging: logical and physical. The physical logging level takes care of updating the cache images of disk blocks to match what the logical logging level expects before it can apply its changes. These two logging levels are interleaved: where a logical change requires splitting a btree block, the resulting blocks are logged logically, but linked into the parent btree block using a logical update stored in the commit block of the physical log transaction. How cool is that? The physical log transactions are not just throwaway things, they are the actual new data. Only the commit block is discarded, which I suppose will leave a lot of one block holes around the volume, but then I do not have to require that the commit block be immediately adjacent to the body of the transaction, which will allow me to get good value out of such holes. On modern rotating media, strictly linear transfers are not that much more efficient than several discontiguous transfers that all land relatively close to each other. Tux3 does not use undo logging, only redo, so a transaction is complete as soon as there is enough information on durable media to replay the redo records. Yes. Each new commit records the sequence number of the oldest commit that should be replayed. So the train lays down track in front of itself and pulls it up again when the caboose passes. For now, there is just one linear sequence of commits, though I could see elaborating that to support efficient clustering. One messy detail: each forward log transaction is written into free space wherever physically convenient, but we need to be sure that that free space is not allocated for data until log rollup has proceeded past that transaction. One way to do this is to make a special check against the list of log transactions in flight at the point where extent allocation thinks it has discovered a suitable free block, which is the way ddsnap currently implements the idea. I am not sure whether I am going to stick with that method for Tux3 or just update the disk image of the free tree to include the log transaction blocks and somehow avoid logging those particular free tree changes to disk. Hmm, a choice of two ugly but workable methods, but thankfully neither affects the disk image. Yes, that is my expectation. I think everything will perform fine up to a few hundred snapshots without special optimization, which is way beyond current expectations. Most users only recently upgraded to journalling filesystems let alone having any exposure to snapshots at all. Right, when writing the file takes longer than the snapshot interval. For Tux3 with an hourly snapshot schedule that will only be four or five versions of the [mtime, size] attribute to reflect the 4.7 hours write time or so, and just one version of each block pointer. The inode table block is split at a boundary between inodes. An "inode" is broken up into attribute groups arranged so that it makes sense to update or version all the members of an attribute group together. The "standard" attribute group consists of mode, uid, gid, ctime and version, which adds up to 16 bytes. The smallest empty inode (touch foo) is 40 bytes and a file with "foo" in it as immediate data is 64 bytes. This has a standard attribute, a link count attribute, a size/mtime attribute, and an immediate data attribute, all versioned. An inode with heavily versioned attributes might overflow into the next btree leaf as described elsewhere. Free inodes lying between active inodes in the same leaf use two bytes each, a consequence of the inode leaf directory, which has a table at the top of the leaf of two byte pointers giving the offset of each inode in the leaf, stored backwards and growing down towards the inode attributes. (This is a slight evolution of the ddsnap leaf format.) Split the inode in memory, allocating a new buffer; forward log the two new pieces physically out to disk with a logical record in the commit block recording where the two new pointers are to be inserted without actually inserting them until a logical rollup episode is triggered. I can think of three common cases of files that get a lot of historical modifications: * Append log - The first iteration in each new version generates a new size attribute * Truncate/rewrite - The first iteration in each new version generates a new size attribute and either a new file index root or a new immediate data attribute * Database - The first file change in each new version generates a new versioned pointer at the corresponding logical address in the file btree and a new size attribute (just because mtime is bundled together with the size in one attribute group). So the only real proliferation is the size/mtime attributes, which gets back to what I was thinking about providing quick access for the "current" version (whatever that means). Yes, the proposal is to treat truncation to exactly zero as a special case. It is a tiny amount of extra code for what is arguably the most common file rewrite case. Essentially I chose the same strategy, except that I have the file trees descending from the inode table instead of stretching out to the side. I think this gives a more compact tree overall, and since I am using just one generic set of btree operations to handle these two variant btrees, additional code complexity is minimal. There is a versioned link count attribute in the inode. When a link attribute goes to zero for a particular version it is removed from the inode. When there are no more link attributes left, that means no directory block references the inode for any version and the inode may be reused. Note: one slightly bizarre property of this scheme is that an inode can be reused in any version in which its link count is zero, and the data blocks referenced by different versions in the same file index can be completely unrelated. I doubt there is a use for that. Ext2 dirents are 8 bytes + name + round up to 4 bytes, very tough to beat that compactness. We have learned through bitter experience that anything other than an Ext2/UFS style physically stable block of dirents makes it difficult to support NFS telldir cookies accurately because NFS vs gives us only a 31 bit cookie to work with, and that is not enough to store a cursor for, say, a hash order directory traversal. This is the main reason that I have decided to go back to basics for the Tux3 directory format, PHTree, and make it physically stable. In the PHTree directory format lookups are also trivial: the directory btree is keyed by a hash of the name, then each dirent block (typically one) that has a name with that hash is searched linearly. Dirent block pointer/hash pairs are at the btree leaves. A one million entry directory has about 5,000 dirent blocks referenced by about 1000 btree leaf blocks, in turn referenced by three btree index blocks (branching factor of 511 and 75% fullness). These blocks all tend to end up in the page cache for the directory file, so searching seldom references the file index btree. I did some back of the envelope calculations of the number of cache lines that have to be hit for a lookup by Hammer with its fat btree keys and lower branching factor, vs Tux3 with its high branching factor, small keys, small pointers, extra btree level for dirent offset stability and stupidly wasteful linear dirent search. I will not bore the list with the details, but it is obvious that Tux3/PHTree will pass Hammer in lookup speed for some size of directory because of the higher btree fanout. PHTree starts from way behind courtesy of the 32 cache lines that have to be hit on average for the linear search, amounting to more than half the CPU cost of performing a lookup in a million element directory, so the crossover point is somewhere up in the millions of entries. Thus annoyed, I cast about for a better dirent leaf format than the traditional Ext2 directory block, and found one after not too much head scratching. I reuse the same leaf directory format as for inode leaf blocks, but this time the internal offsets are sorted by lexical name order instead of inode number order. The dirents become Ext2 dirents minus the record number format, and minus the padding to four byte alignment, which does not do anything useful. Dirent inum increases to 6 bytes, balanced by saving 1.5 pad bytes on average, so the resulting structure stores about 2% fewer dirents than the Ext2 format against a probable 50% reduction in CPU latency per lookup. A fine trade indeed. The new format is physically stable and suitable for binary searching. When we need to manage space within the leaf for creating or deleting an entry, a version of the leaf directory ordered by offset can be created rapidly from the lexically sorted directory, which occupies at most about six cache lines. The resulting small structure can be cached to take advantage of the fact that most mass creates and deletes keep hitting the same dirent block repeatedly, because Tux3 hands the dirents to getdents in physical dirent order. I concluded that both Hammer and PHTree are exceptionally fast at name resolution. When I have the new dirent block format in place I expect to really hammer Hammer. But then I do not expect Hammer to stand still either ;-) A PHTree directory grows like an append-only file, a block at a time, though every time a new entry is added, mtime has to change, so mtime changes more often than size. They are grouped together in the same attribute, so the distinction is moot. If the directory is modified at least once after every snapshot (quite likely) then the snapshot retention policy governs how many mtime/size attributes are stored in the inode. Say the limit is 1,024 snapshots, then 16K worth of those attributes have to be stored, which once again encourages me to store the "current" mtime specially where it can be retrieved quickly. It would also be a good idea to store the file data attribute in front of the mtime/size attributes so only stat has to worry about the bulky version info, not lookup. Note that there is no such thing as revving an entire inode, only bits and pieces of it. For truly massive number of versions, the plan is to split up the version tree into subtrees, each having a thousand versions or so. For each version subtree (hmm, subversion tree...) there is a separate inode btree carrying only attributes and file data owned by members of that subtree. At most log(subtrees) of those tables need to be accessed by any versioned entity algorithm. (Note the terminology shift from versioned pointers to versioned entities to reflect the fact that the same algorithms work for both pointers and attributes.) I think this approach scales roughly to infinity, or at least to the point where log(versions) goes vertical which is essentially never. It requires a bounded amount of implementation effort, which will probably be deferred for months or years. Incidentally, PHTree directories are pretty much expand-only, which is required in order to maintain physical dirent stability. Compaction is not hard, but it has to be an admin-triggered operation so it will not collide with traversing the directory. It is fine for diving into old snapshots. Lookup is always O(elements) where an element is a versioned pointer or (later) extent, which are very compact at 8 bytes each. Because the truncate/rewrite file update case is so common, you will rarely see more than one version at any given logical address in a file. A directory growing very slowly could end up with about 200 different versions of each of its final few blocks, one for each added entry. It takes on the order of a microsecond to scan through a few hundred versioned pointers to find the one that points at the version in which we are interested, and that only happens the first time the directory block is read into the page cache. After that, each reference to the block costs a couple of lookups in a page cache radix tree. The version lookup algorithm is not completely obvious: we scan through the list of pointers looking for the version label that is nearest the version being accessed on the path from that version to the root of the version tree. This potentially expensive search is accelerated using a bitmap table to know when a version is "on the path" and a pre-computed ord value for each version, to know which of the versions present and "on the path" for a given logical address is furthest from the root version. This notion of furthest from the root on the path from a given version to the root implements data inheritance, which is what yields the compact representation of versioned data, and is also what eliminates the need to explicitly represent the death version. You have discovered this too, in a simpler form. I believe that once you get the aha you will want to add versions of versions to your model. That is really sweet, provided that you are ok with the big keys. It was smart to go with that regular structure, giving you tons of control over everything and get the system up and running early, thus beating the competition. I think you can iterate from there to compress things and be even faster. Though if I were to presume to choose your priorities for you, it would be to make the reblocking optional. So you have to search through a range of TIDs to find the operative one for a particular version, just as I have to do with versioned pointers. Now just throw in a bitmap like I use to take the version inheritance into account and you have versioned pointers, which means snapshots of snapshots and other nice things. You're welcome :-) I think that versioned pointer lookup can be implemented in O(log(E)) where E is the number of versioned pointers (entities) at the same logical address, which would put it on a par with btree indexing, but more compact. I have sketched out an algorithm for this but I have not yet tried to implement it. This is where I do not quite follow you. File contents are never really deleted in subsequent versions, they are just replaced. To truncate a file, just add or update size attribute for the target version and recover any data blocks past the truncate point that belongs only to the target version. Extended attributes and dirents can be truly deleted. To delete an extended attribute that is shared by some other version I will insert a new versioned attribute that says "this attribute is not here" for the current version, which will be inherited by its child versions. Dirent deletion is handled by updating the dirent block, possibly versioning the entire block. Should the entire block ever be deleted by a directory repacking operation (never actually happens on the vast majority of Linux systems) that will be handled as a file truncate. I see that in Hammer, the name is deleted from the btree instead, so fair enough, that is actual deletion of an object. But it could be handled alternatively by adding a "this object is not here" object, which might be enough to get rid of delete_tid entirely and just have create_tid. I plan to scan the whole btree initially, which is what we do in ddsnap and it works fine. But by looking at the mtime attributes in the inode I can see in which versions the file data was changed and thus not have to scan most files. Eventually, building in some accelerator to be able to skip most inode leaves as well would be a nice thing to do. I will think about that in the background. That is nearly identical to the ddsnap replication algorithm, and we also send the list over a pipe. Ddsnap does not deal with attributes, only volume blocks, otherwise I think we arrived at the same thing. Hmm. I have quite a few bits available in the inode table btree node pointers, which are currently only going to be free inode density hints. Something to think about. Very, very nice. This is a job for forward logging :-) There is a slight skew in your perception of the function of forward logging here, I think. Forward logs transactions are not long-lived disk objects, they just serve to batch together lots of little updates into a few full block "physical" updates that can be written to the disk in seek-optimized order. It still works well for this application. In short, we propagate dirty bits up the btree while updating the btree disk blocks only rarely. Instead, the node dirty bits are tucked into the commit block of the write transaction or namespace edit that caused the change, and are in turn propagated into the commit block of an upcoming physical rollup, eventually working their way up the on-disk btree image. But they are present in the cached btree image as soon as they are set, which is the structure consulted by the replication process. I hope that question is cleared up now. Tux3 also has an in-memory cache and an on-disk btree, and in addition, the in-memory cache is identical to some future version of the on-disk btree, which ought to be good for losing a lot of corner cases. Things are a little different in Linux. The VFS takes care of most of what you describe as your filesystem front end, and in fact, the VFS is capable of running as a filesystem entirely on its own, just by supplying a few stub methods to bypass the backing store (see ramfs). I think this is pretty much what you call your front end, though I probably missed some major functionality you provide there that the VFS does not directly provide. A filesystem backend on Linux would be the thing that implements the prepare_write/commit_write calls that come in from the one-size-fits all generic_file_buffered_write function that most filesystems use to implement write(2). These functions generally just call back into the block library, passing a filesystem-specific get_block callback which the library uses to assign physical disk sector addresses to buffers attached to the page. Ext3 opens a journal transaction in prepare_write and finishes it in commit_write. Not much of a transaction if you ask me. Tux3 wants to do things in a more direct way and in bigger chunks. Actually, all Tux3 needs to do with the prepare_write/commit_write calls that generic_file_buffered_write throws at it is remember which inodes where involved, because the inode keeps a list of dirty pages, the same ones that prepare_write/commit_write is trying to tell us about. When there are enough of them, on one of the commit_write calls Tux3 can go delving into its file index btree to find or create a place to store some. This will generate changes to the btree that must be logged logically in the commit block of the write transaction that is being set up, which now gets its sequence number so that the logical changes can be applied in the same order if replay is needed. At the point Tux3 is asked to generate some new logical changes, it may decide that it has already sent enough logical changes onto disk and now would be a good time to roll some of them up. The rolled up block images are sitting in the buffer cache where Tux3 has prudently locked them by incrementing the page use count. Tux3 now sets up some physical log transactions to write out the images. Before doing so, it copies each image to a new location in the buffer cache and modifies pointers in the parent images to point at the new images (possibly having to read in the parents first). This generates new logical changes (the changes to the parent nodes and changes to the free tree to allocate the new buffer cache positions) which will be logically logged in the commit blocks of the physical transaction about to be committed. When the rollup transactions have been sent to disk, Tux3 can continue allocating disk space etc by updating the new images of the disk blocks and setting up new transactions, but no transactions that depend on the rolled up state of the disk blocks can be allowed to complete until the rollup transactions have completed, which Tux3 learns of by counting the interrupt mode ->endio calls that come back from the bio transfers. Note: no other filesystem on Linux current works this way. They all pretty much rely on the block IO library which implements the fairly goofy one-block-at-a-time transfer regimen. The idea of setting up bio transfers directly from the filesystem is unfortunately something new. We should have been doing this for a long time, because the interface is quite elegant, and actually, that is just what the block IO library is doing many levels down below. It is time to strip away some levels and just do what needs to be done in the way it ought to be done. Notice how recursive the whole idea is. Logical transactions cause physical transactions which cause logical transactions etc. I am taking it on faith that the recursion terminates properly and does not get into cycles. Over time I will give concrete reasons why I think it all just works. Digression done, back to the Hammer algorithms... Hmm, and I swear I did not write the above before reading this paragraph. Many identical ideas there. That is pretty close to the mark, but there are UNDO records in Tux3, there are only logical and physical REDO records. The forward log is supposed to try to avoid syncing the beginning of the chain to a known location as much as possible, which it does by having at least one transaction in its pipeline that has not been committed to disk yet and is part of an existing forward chain. The chain is extended by allocating a location for the commit block of a new transaction being formed, storing that location in the commit block of the existing transaction, then committing the existing transaction to disk. I think you will like it even more when you realize that updating the filesystem header is not required for most transactions. A versioned extent: struct extent { unsigned version:10, count:6, block:24; }; Directory format to index the extent within a leaf: struct entry { unsigned loglo:24, offset:8 }; struct group { unsigned loghi:24, count:8 }; Leaf layout (abridged): struct leaf { unsigned groups; struct group[]; struct entry[]; struct elements[] }; Extents are divided into groups, all of which have the same upper 24 bits of logical address. The 8 bit offset gives the position of the first extent with that logical offset, measured in 8 byte units from the base of the extent group. The difference between two successive offset bytes is the number of extents with the same logical offset. The first offset byte gives the total size of the group, and would otherwise be always be zero and that would be a waste. The base of the ith group is determined by totalling the zeroth offset bytes. The 48 bit logical address is split between the 32 bit dictionary and the 32 bit group entry. This format gives the ability to binsearch on both entries and groups. The number of groups per leaf should be just one or two at deep leaves where considerable splitting of the logical address has already occurred, so the overhead of the dictionary is just 32 bits per extent roughly, giving a total overhead of 12 bytes per extent. Shallow btrees will approach 16 bytes per extent, but the tree is shallow anyway so it does not matter. The shallowest tree consists of just a few extents recorded as an attribute in the inode btree. I introduce the rule that for such an extent attribute, the extents belonging to each version are presumed to be logically contiguous, so no directory is needed and the extent overhead drops to 8 bytes in this common case. Actually, most files should just consist of a single extent. That could be a new attribute form, a size/extent attribute, with the mtime assumed to be the same as the ctime, which I think would cover a majority of files on the system I am using right now. See, I am really obsessive when it comes to saving bits. None of these compression hacks costs a lot of code or has a big chance of hiding a bug, because the extra boundary conditions are strictly local. We shall see. Factors working in my favor are: * Snapshots are explicit, so you have to create lots of them plus heavily update the data in order to make the versioned data bushy. * The linux page cache not only caches data but provides an efficient access path to logical blocks that avoids having to consult the filesystem metadata repeatedly under heavy traffic. * Compact metadata entities including versioned attributes and extents (pointers for now) * The kind of load that people put on a Netapp is a trivially small number of versions for Tux3. Btree, which I call the file index btree. Immediate data is allowed in an inode as a versioned attribute. It did not occur to me to leave them in the logical log, but yes that is a nice optimization. They will only stay there for a while, until a rollup is triggered, moving them into the inode table proper. It gets a little tricky to figure out whether the amortization due to the logical logging step is worth the extra write. There is a crossover somewhere, maybe around a hundred bytes or so, ideal for symlinks and acls. For rotating media, true. There is also flash, and Ramback... I found that from my HTree experience. I found that pretty much the entire cost of the bad old linear dirops was CPU and I was able to show a ridiculous speedup by cutting the cost down to O(log511(ops)) from O(ops^2). OK, you convinced me. I already have btree cursors (see ddsnap path[]) that are so far only used for per operation btree access and editing. So these will eventually become cache objects. Do the variable sized page hack :-] Or alternatively look at the XFS buffer cache shim layer, which Christoph manhandled into kernel after years of XFS not being accepted into mainline because of it. Let me see, for Tux3 that is 64 bytes for the inode part including a tiny amount of data, plus 32 bytes or so for the dirent part, say 100 bytes. This is my reward for being an obsessive bit mizer. With pointers it is about a megabyte per gigabyte for Tux3 too (and Ext2 and Ext3) so going to extents is not optional. I plan a limit of 64 blocks per extent, cutting the metadata overhead down to 16K per gigabyte, which is hard to complain about. Larger extents in the file index would not buy much and would mess up the nice, 8 byte one-size-fits-all versioned extent design. Waiting for me to prove the idea seems reasonable. That is nice, but since I want to run without a pruner I had better stick with the logical logging plan. The link count of the inode will also be committed as zero, or rather, the link attribute will be entirely missing at that point, which allows for a cross check. I was thinking more about the operation of taking somebody's home directory and turning it into a separate volume, for whatever reason. Tux3 snapshots (versions) have that write/rollback ability, and also have it for snapshots of snapshots, just in case that was not clear. Yes, it sounds nice. The lookup path is actually quite simple, I am far enough in to know that. The versioning algorithms are the most complex piece from an algorithmic point of view, and as you can see from the sample implementation, they ha
| david | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg Kroah-Hartman | [PATCH 004/196] Chinese: add translation of SubmittingPatches |
| Paul Jackson | Re: cpuset-remove-sched-domain-hooks-from-cpusets |
| Rafael J. Wysocki | [Bug #11210] libata badness |
| David Miller | Re: [GIT]: Networking |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Andrew Morton | Re: [BUG] New Kernel Bugs |
git: | |
