Good day storage mavens, Every now and then down in the unexciting and arcane underbelling of storage research one encounters some real computer science. Today I will outline one such encounter, which I expect to result in significant improvements to the capability and performance of our ddsnap snapshotting block device, and most probably has benefits in other areas as well, such as snapshotting filesystems. I apologize in advance for any forward references, as I wish to get straight to the meat of this. =46or definitions of various domain specific terms, please see the terminology section at the end. A snapshot of a disk volume may be represented as a list of exceptions to an origin volume, where each exception points to a logical chunk of data in a "snapshot store", which holds the data for the snapshot at a particular logical address. For any logical address for which no exception is present, the data resides on the origin volume. Thus, the entire set of exceptions for a snapshot can be thought of as a list of places where the snapshot differs from the origin. The "versioned pointer" method is a new way of representing exceptions for multiple simultaneous snapshots in a surprisingly compact form. The idea was inspired by a proposal from Steve Vandebogart (interning at Google) to represent exception sharing for a series of volume snapshots using a snapshot sequence number in place of a fixed size chunk sharing bitmap, as is the existing practice in ddsnap. The sequence number would be fewer bits than the bitmap and could be packed into unused bits of a physical block pointer, which would shrink our snapshot metadata significantly. However, when the sequence eventually wraps a full metadata edit would be needed to relabel stored exceptions, and so some snapshot creates would take much longer than others. Users generally do not like to wait around while snapshots are created. Versioned Pointers This issue was addressed by introducing a one to one mapping between snapshots and arbitrary version numbers in place of sequence numbers (and in the process notched up yet another example of a problem solved by adding a new layer of indirection). The idea of sequenced pointers thus begat version labels, which begat the notion of "versioned pointers", the subject of this post. It occurred to me that a pointer version label is similar to a revision control version id, and I proceeded to investigate the question of whether revision control techniques could be applied to volume snapshots. I had experimented with a system where each piece of text is labeled by the version in which it first appears, and by the version in which it disappears (the "birth" and "death" labels, essentially a binary weave). By knowing the hierarchical relationship between versions, the "version tree", it is possible to determine whether any given piece of labeled text belongs to a given version. Volume versioning differs from text versioning in that data chunks do not appear and disappear, but only change. Each change of chunk data can be viewed as a disappearance followed by an appearance, so only a single label needs to be associated with each change. The other label is implied by the presence of a change labeled by some descendent in the version tree. It became clear that a single version label for each rewrite of a given volume chunk is enough to determine the set of versions to which the written data belongs. Version Inheritance If a given version does not have an exception of its own for a particular logical address, then it inherits an exception from its parent, which in turn inherits from its parent if it does not have its own, and so on up to the root of the version tree. The root implicitly inherits all the chunks of the origin volume. Thus, a snapshot of the origin volume is just a new root of the version tree, with the old root as its child. Similarly, a snapshot of a snapshot is a new child of some version, and inherits all the exception of its parent, exactly as required. Since we have not so far had an efficient way of creating snapshots of snapshots in ddsnap, the possibly of being able to do so merely by altering a version tree seemed very interesting. Thus encouraged, I set out to determine whether a stable representation is possible in the sense that no full metadata edit would ever be required to maintain this representation except when a version is deleted. If this were the case then this new technique would be well suited to volume snapshotting, where a large amount of durable metadata has to be maintained and updated with good locality. Examples Here is an example version tree with version nodes labeled in the order created: =2E `-- C '1003' =A0 =A0 =A0`-- B '1002' =A0 =A0 =A0 =A0 =A0|-- A '1001' =A0 =A0 =A0 =A0 =A0`-- D =A0 =A0 =A0 =A0 =A0 =A0 =A0|-- E '1005' =A0 =A0 =A0 =A0 =A0 =A0 =A0`-- F =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0|-- G '1007' =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0`-- H '1006' Versions A, B and C are snapshots of the origin. =A0(They appear inverted in the tree because new origin snapshots are added at the root.) =A0All other versions represent snapshots of snapshots. Most of the versions have snapshot tags, in quotes. Those that do not are inaccessible ghost versions, explained below. Given the exception list [[A, p1] [B, p2]] where p1 and p2 are physical chunks, and omitting the snapshot tags, we have: =2E `-- C =A0 =A0 =A0`-- B =3D> p2 =A0 =A0 =A0 =A0 =A0|-- A =3D> p1 =A0 =A0 =A0 =A0 =A0`-- D =3D> p2 =A0 =A0 =A0 =A0 =A0 =A0 =A0|-- E =3D> p2 =A0 =A0 =A0 =A0 =A0 =A0 =A0`-- F =3D> p2 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0|-- G =3D> p2 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0`-- H =3D> p2 Version A is the exclusive owner of chunk p1 while all other versions except C inherit p2. =A0C has no exception at this logical address, therefore inherits a chunk of the origin. Another way to represent this is to overlay the chunks of the exception list on their respective versions: =2E `-- C =A0 =A0 =A0`-- B [p2] =A0 =A0 =A0 =A0 =A0|-- A [p1] =A0 =A0 =A0 =A0 =A0`-- D =A0 =A0 =A0 =A0 =A0 =A0 =A0|-- E =A0 =A0 =A0 =A0 =A0 =A0 =A0`-- F =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0|-- G =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0`-- H This representation shows both the global version tree and an exception list for a particular logical address. Writable Snapshots It is highly desirable that snapshots be writeable. As an example of why one might wish to do this, consider a virtualized server farm where each server has its own, slightly different image of the root volume. For each virtual server instance, a snapshot of a generic root volume is taken and customized for or altered at runtime by one of the server instances. (This also constitutes a use case for snapshots of snapshots, to avoid a clumsy revert of the origin volume for each new server instance.) Writeable snapshots are problematic in terms of inheritance: if a pointer for a newly allocated snapshot store chunk pointer is labeled with a given version then it may be inherited by other children of the same version, violating the principle of snapshot isolation (a write to a snapshot may not change any other snapshot). My solution is to generate a new, implicit version as a child of the written version, to which the new versioned pointer can be added without affecting any other version. Ghost Versions Adding a new layer of indirection was the method of choice once again: each snapshot is known externally by a numeric "tag" which is the handle for all operations on the snapshot, such as creating a virtual block device for it or deleting it. When a snapshot mapping to a version having one or more children is written for the first time, a new version is generated as above and its tag is reassigned to the new version. The original version loses its tag and cannot thereafter be accessed externally, becoming a "ghost version". A ghost version exists only to allow its children to continue to inherit any data written to it before it had any children. Thinking back on it, this was a fairly radical idea because it was far from clear that the implicitly snapshot strategy would not rapidly exhaust the limited supply of version numbers. Remarkably, it turned out to be the case that only a limited number of ghost versions could ever be created, that limit being one less than the number of externally known snapshots. Unfortunately, the proof of this will not fit in the margin of this email. An essential property of a ghost version is that, because it is not externally accessible and thus cannot be read or written, it need not be a consistent point in time version of a volume. Therefore its exceptions can (and must) be freely cannibalized by other versions as the geometry of the version tree changes and exception lists are modified over time. To illustrate, given a write to a snapshot with tag '1001' mapping to version A, with one child snapshot: =2E `-- A '1001' =A0 =A0 =A0`-- B '1002' A version C is implicitly created to hold the new exception [C, p1], which could not have been added to version A because it would then have been inherited by version B, violating the isolation of snapshot 1002: =2E `-- A =A0 =A0 |-- B '1002' =A0 =A0 =A0`-- C [p1] '1001' Version A has become a ghost. Version Deletion Problem When a snapshot with exactly one child is deleted, all the exceptions it owns can simply be relabelled to the child version, and the child version replaces the deleted version in the version tree. This does not violate snapshot isolation because the chunks in question were previously inherited by the child anyway. When a version with more than one child is deleted, it is not obvious how to relabel exceptions belonging to the deleted version. It would be possible to generate duplicate exceptions sharing the original chunk pointers, but this would raise the specter of metadata that can expand during a delete. When the snapshot store is nearly full the delete could fail for lack of space. Not only would this surprise the user, but in the case of automatic deletion to recover space to service a write to a snapshot, the system could deadlock. There is no such thing as an elegant workaround for an expand-in-delete problem. Ghost Versions Redux The solution I adopted is to leave the version associated with a deleted snapshot in the version tree, along with any inherited exceptions. The version's tag is deleted, creating what I initially called a "zombie" version. Later we discovered that zombie and ghost versions are in fact the same, and in particular, share the property that they cannot proliferate without bound. (This was just one of many close calls for the versioned pointer method, where subtle logic came to the rescue in the face of some seemingly insurmountable problem.) A ghost version can always be deleted when its child count drops to one at the time one of its remaining two children was deleted. The single remaining child is promoted to take its place in the veresion tree, and all exceptions belonging to the ghost are relabeled to the child as in the case of explicit deletion of a snapshot with a single child. This fact is one link in the long chain of reasoning required to show that ghost can never outnumber explicitly created snapshots. As a corollary, each ghost version requires at least two children to force its continued existence. (Before leaping to the conclusion that this shows there are always more explicitly snapshots than ghost versions, consider that both children may themselves be ghosts.) Exclusive Exceptions An exclusive exception is one that is not inherited by any child, either because the owner has no children, or each of its heirs has an exception for the same chunk address, or every version that inherits the exception is a ghost. An exclusive exception's data chunk can always be modified without affecting any other snapshot. In particular, whenever a logical chunk of a given snapshot is written repeatedly with no intervening snapshot creates, all but the first write are guaranteed to be to exclusive exceptions. Orphan Exceptions An orphan exception is an exclusive exception that belongs to a ghost. Because it cannot be read or written directly via a snapshot tag or indirectly by being inherited, it is completely inaccessible and will only waste space in the snapshot store if allowed to exist. Orphan exceptions may be created by writes to snapshots or deletes, and the respective algorithms must detect and delete them immediately to avoid leakage. Potentially long and complex chains of inheritance can be involved, so this requirement gives rise to some particularly subtle rules. For example: "If a target has no children and no exception then removing it or replacing its ghost parent with no exception by a sibling of the target with no heirs reduces heirs of the parent. If heirs are reduced, a search for a ghost ancestor with an uninherited exception must be performed." Reading from a Snapshot Reading from a snapshot at a given logical chunk address requires determining which exceptions are present in the exception list for that logical address, and are labeled by versions lying between the target version and the root of the snapshot tree. If any such exceptions are found, then the exception furthest from the root owns the snapshot data of interest. Otherwise the data for that logical address resides on the origin volume. Writing to a Snapshot After a write to particular logical chunk of a snapshot an exclusive exception for that chunk of that snapshot will always exist, containing the written data. Before the write, the snapshot may have had no exception but inherited an exception from some other snapshot or inherited a chunk from the origin, or it may have had an exception that is inherited by one or more of its descendants, or it may have already had an exclusive exception. In all cases except the last, an exception is added to the snapshot at that logical address. If an exception is inherited from a ghost and the written snapshot is the sole heir of the exception, then the exception will be relabeled, otherwise a new exception will be created, including allocating a snapshot store chunk to hold the exception data. Writing to the Origin If an origin chunk is inherited by any snapshot, then the data about to be overwritten must be copied to the snapshot store before the new data is written. If the root of the snapshot already has an exception, then the origin chunk is clearly not inherited and nothing needs to be done. Otherwise a new exception for the root version is created and the data to be preserved from the origin is copied to the associated chunk. Deleting a Snapshot Snapshot deletion is by far the most complex aspect of the versioned pointers technique. Various rules must be enforced, for example, that no ghost version may exist with less than two children. If the deleted snapshot has one child then the child will be promoted to be a child of the deleted snapshot's parent. If it has two or more children then the deleted snapshot will become a ghost. If it has no children and its parent is a ghost with two children, then the parent will be deleted and the sibling of the deleted snapshot will be promoted to take the place of the parent. Besides adjusting the version tree, each exception for the deleted version (or versions if the parent is also deleted) must be either removed from the snapshot metadata if it is not inherited by any version, or relabeled to one of its children if it is. Any exceptions that become orphans as a result of no longer being inherited by the deleted snapshot must be detected and removed. Implementation The attached test program implements all six basic snapshot operations: 1) Create snapshot of origin 2) Create snapshot of snapshot 3) Delete snapshot 4) Write to origin 5) Write to snapshot 6) Read from snapshot None of the algorithms are worse than O(n) where n is the number of versions in the version tree. Most are O(e) where e is the number of exceptions in an exception list. This level of performance was largely achieved through pre-computation of lookup tables to accelerate such operations as determining whether a given version lies on the path from some other version to the root of the version tree. The snapshot delete in particular started as an O(n^3) algorithm before being gradually improved to O(e) using mapping tables and a multipass approach. It is thought that all the algorithms can eventually be improved to O(log n) worst case performance. Meanwhile, since n is limited to a fairly small number by the number of bits available in a version label, real world performance appears entirely satisfactory. Compared to our current representation of snapshot exception data, fitting more data in cache is likely to have a bigger effect than the slightly more expensive base operations. Extents The versioned pointer technique is compatible with extent-based data representations, much more so than the snapshot exception representation it is expected to replace, which ties snapshots together with a sharing bitmap that would be complex to represent in an extent form, as the extents are likely to be different for each snapshot. On the other hand, each exception represented by a versioned pointer simply needs a count field added to become an extent. Using 64 bit exceptions as we do, there is enough room in the pointer for 48 bits of logical address, addressing an even exabyte of snapshot store, 10 bits of version label allowing 512 user visible snapshots, and a 6 bit extent count, allowing extents up to 256K assuming 4K blocks. Allowing for other overheads, this gives a data to metadata ratio of about 15,000 to one, beyond which there is not a great deal of additional benefit to be obtained from extents. Extension of the current algorithms to handle extents is in progress. Application to filesystems The versioned pointer technique is by no means limited to representing volume data. It also has promise as a means of implementing filesystem snapshots. Compared to the technique used by WAFL or ZFS, there is no recursive copying of tree-structured metadata. Compared to Btrfs, there are no reference counts. A possible deficiency is the bias towards representing all version information for a given logical address together at the same location in the metadata. If there is a lot of version metadata and only a single snapshot is being accessed, a larger amount of metadata may have to be read. Extending the method to handle extents would mitigate this. If applied to filesystems, it would be feasible to independently snapshot each file and each directory. =46uzzing Test The attached test program implements a "fuzzing test" to verify empirically the correct implementation of the versioned pointers operations. Without loss of generality, only a single exception list is tested, because each exception list is self contained and independent from all other exception associated with other logical addresses. Each iteration of the fuzz test randomly selects one of the first five operations and performs it on a random target, then reads every snapshot to verify that all contain the expected data. Various consistency tests are performed, for example: * All exception labels are valid * No deleted version labels an exception * No multiple exceptions with same label in same list * No ghosts have orphan exceptions * No ghost has less than two children * No cycles in the version tree This fuzzing test has run successfully to ten million iterations. This does not prove that the algorithms are correct, but it is encouraging. Along the way, seemingly endless bugs were discovered especially in the area of ghost exception detection. Some of the invariants described in this note were discovered in response to bugs detected by the fuzzing test, which is not to say that empirical observation is a substitute for logical analysis, but that the underlying logic was most certainly discovered faster than it could have been by logical analysis alone. Conclusion The versioned pointer method is a novel technique for representing versioned disk data, which promises the following benefit to the ddsnap snapshotting block device project: =A0 =A01) Maximum number of snapshots increases =A0 =A02) Metadata shrinks by up to 50% =A0 =A03) Supports instant creation of snapshots of snapshots =A0 =A04) Easier to move to extents for additional compactness It is also possible that the method may prove useful for filesystem snapshots. A prototype implementation exists, demonstrating the efficacy of the technique and empirically verifying correctness. Terminology Snapshot: =A0A volume version tagged with a unique id by which it can be accessed and operated on externally. Snapshot tag: =A0An externally visible 32 bit integer specified by an application programs at the time a snapshot is created. =A0The snapshot tag is used to create a virtual snapshot volume through which the snapshot can be read and written. Snapshot chunk: =A0A chunk in the snapshot store holding data that was either copied from the origin due to a write to an origin logical volume or written to a snapshot logical volume. Origin chunk: =A0A chunk of data on the origin volume that is implicitly shared at the same address by any version that does not have alternate data at that address. Version label: =A0A unique id carried by each node in the version tree. Version tree: =A0The root of the version tree is the most recent snapshot of the origin. =A0Each new version of the origin becomes the new root of the version tree and the parent of the old root. =A0Each new version of a version becomes a child of the existing version. Chunk inheritance: =A0When a new version is created it inherits every chunk of its parent. =A0The root of the version tree inherits every chunk of the origin volume. =A0Care must be taken never to write to an inherited chunk, otherwise all versions inheriting the chunk will be altered in violation of the principle of isolation between versions (the I in ACID). Exception list: =A0A list of exceptions for a given logical chunk address defining which physical chunks belong to which versions. =A0Each exception is said to be "at" that logical address. Exception: =A0A pair [V, p] that appears in an exception list to specify that the physical chunk p is inherited by all nodes in the version subtree rooted at the node labelled V, and bounded by nodes labelled by other exceptions in the same list. =A0Each physical chunk is owned by exactly one exception. =A0An exception labeled by a given version is said to be "at" that version. We may say read or write "to an exception" which really means "to the physical chunk owned by the exception". Unique vs shared chunks: =A0A physical chunk is said to be unique if it is not inherited and shared otherwise. =A0An origin chunk at a given logical address is unique if and only if there is an exception at that address labeled by the root version. =A0A chunk of the origin or a version can be written to without affecting any other version if and only if it is unique. Regards, Daniel
