linSubscribed now, everything should be OK.
On Friday 25 July 2008 19:02, Matthew Dillon wrote:
Yes. Writes tend to be highly parallel in Linux because they are
mainly driven by the VMM attempting to clean cache dirtied by active
writers, who generally do not wait for syncing. So this will work
really well for buffered IO, which is most of what goes on in Linux.
I have not thought much about how well this works for O_SYNC or
O_DIRECT from a single process. I might have to do it slightly
differently to avoid performance artifacts there, for example, guess
where the next few direct writes are going to land based on where the
most recent ones did and commit a block that says "the next few commit
blocks will be found here, and here, and here...".
When a forward commit block is actually written it contains a sequence
number and a hash of its transaction in order to know whether the
commit block write ever completed. This introduces a risk that data
overwritten by the commit block might contain the same hash and same
sequence number in the same position, causing corruption on replay.
The chance of this happening is inversely related to the size of the
hash times the chance of colliding with the same sequence number in
random data times the chance of of rebooting randomly. So the risk can
be set arbitrarily small by selecting the size of the hash, and using
a good hash. (Incidentally, TEA was not very good when I tested it in
the course of developing dx_hack_hash for HTree.)
Note: I am well aware that a debate will ensue about whether there is
any such thing as "acceptable risk" in relying on a hash to know if a
commit has completed. This occurred in the case of Graydon Hoare's
Monotone version control system and continues to this day, but the fact
is, the cool modern version control systems such as Git and Mercurial
now rely very successfully on such hashes. Nonetheless, the debate
will keep going, possibly as FUD from parties who just plain want to
use some other filesystem for their own reasons. To quell that
definitively I need a mount option that avoids all such commit risk,
perhaps by providing modest sized journal areas salted throughout the
volume whose sole purpose is to record log commit blocks, which then
are not forward. Only slightly less efficient than forward logging
and better than journalling, which has to seek far away to the journal
and has to provide journal space for the biggest possible journal
transaction as opposed to the most commit blocks needed for the largest
possible VFS transaction (probably one).
> If the forward log entries are not (all) cached in-memory that would mean
Actually, the btree node images are kept fully up to date in the page
cache which is the only way the high level filesystem code accesses
them. They do not reflect exactly what is on the disk, but they do
reflect exactly what would be on disk if all the logs were fully
rolled up ("replay").
The only operations on forward logs are:
1) Write them
2) Roll them up into the target objects
3) Wait on rollup completion
The rollup operation is also used for replay after a crash.
A forward log that carries the edits to some dirty cache block pins
that dirty block in memory and must be rolled up into a physical log
before the cache block can be flushed to disk. Fortunately, such a
rollup requires only a predictable amount of memory: space to load
enough of the free tree to allocate space for the rollup log, enough
available cache blocks to probe the btrees involved, and a few
cache blocks to set up the physical log transaction. It is the
responsibility of the transaction manager to ensure that sufficient
memory to complete the transaction is available before initiating it,
otherwise deadlock may occur in the block writeout path. (This
requirement is the same as for any other transaction scheme).
One traditional nasty case that becomes really nice with logical
forward logging is truncate of a gigantic file. We just need to
commit a logical update like ['resize', inum, 0] then the inode data
truncate can proceed as convenient. Another is orphan inode handling
where an open file has been completely unlinked, in which case we
log the logical change ['free', inum] then proceed with the actual
delete when the file is closed or when the log is replayed after a
surprise reboot.
Logical log replay is not idempotent, so special care has to be taken
on replay to ensure that specified changes have not already been
applied to the target object. I choose not to go the traditional route
of providing special case tests for "already applied" which get really
ugly or unreliable when there are lot of stacked changes. Instead I
introduce the rule that a logical change can only be applied to a known
good version of the target object, which promise is fullfilled via the
physical logging layer.
For example, on replay, if we have some btree index on disk for which
some logical changes are outstanding then first we find the most recent
physically logged version of the index block and read it into cache,
then apply the logical changes to it there. Where interdependencies
exist between updates, for example the free tree should be updated to
reflect a block freed by merging two btree nodes, the entire collection
of logical and physical changes has to be replayed in topologically
sorted order, the details of which I have not thought much about other
than to notice it is always possible.
When replay is completed, we have a number of dirty cache blocks which
are identical to the unflushed cache blocks at the time of a crash,
and we have not yet flushed any of those to disk. (I suppose this gets
interesting and probably does need some paranoid flushing logic in
replay to handle the bizarre case where a user replays on a smaller
memory configuration than they crashed on.) The thing is, replay
returns the filesystem to the logical state it was in when the crash
happened. This is a detail that journalling filesystem authors tend
to overlook: actually flushing out the result of the replay is
pointless and only obscures the essential logic. Think about a crash
during replay, what good has the flush done?
> And even though you wouldn't have to group transactions into larger
I do not see why this example cannot be logically logged in pieces:
['new', inum_a, mode etc] ['link', inum_parent, inum_a, "a"]
['new', inum_b, mode etc] ['link' inum_a, inum_b, "b"]
['new', inum_c, mode etc] ['link' inum_a_b, inum_c, "c"]
['new', inum_x, mode etc] ['link' inum_a_b, inum_x, "x"]
['new', inum_d, mode etc] ['link' inum_a_b_c, inum_d, "d"]
Logical updates on one line are in the same logical commit. Logical
allocations of blocks to record the possibly split btree leaves and new
allocations omitted for clarity. The omitted logical updates are
bounded by the depth of the btrees. To keep things simple, the logical
log format should be such that it is impossible to overflow one commit
block with the updates required to represent a single vfs level
transaction.
I suspect there may be some terminology skew re the term "transaction".
Tux3 understands this as "VFS transaction", which does not include
trying to make an entire write (2) call atomic for example, but only
such things as allocate+write for a single page cache page or
allocate+link for a sys_link call. Fsync(2) is not a transaction but
a barrier that Tux3 is free to realize via multiple VFS transactions,
which the Linux VFS now takes care of pretty well now after many years
of patching up the logic.
> At some point you have to be able to commit so the whole mess
Indeed. I think I have identified a number of techniques for stripping
away much of that complexity. For example, the governing parent update
can be considered complete as soon as the logical 'link' log commit has
completed.
> Last question here: Your are forward-logging high level operations.
Correct! That is why there are two levels of logging: logical and
physical. The physical logging level takes care of updating the cache
images of disk blocks to match what the logical logging level expects
before it can apply its changes. These two logging levels are
interleaved: where a logical change requires splitting a btree block,
the resulting blocks are logged logically, but linked into the parent
btree block using a logical update stored in the commit block of the
physical log transaction. How cool is that? The physical log
transactions are not just throwaway things, they are the actual new
data. Only the commit block is discarded, which I suppose will leave
a lot of one block holes around the volume, but then I do not have to
require that the commit block be immediately adjacent to the body of
the transaction, which will allow me to get good value out of such
holes. On modern rotating media, strictly linear transfers are not
that much more efficient than several discontiguous transfers that all
land relatively close to each other.
> So your crash recovery code will have to handle
Tux3 does not use undo logging, only redo, so a transaction is complete
as soon as there is enough information on durable media to replay the
redo records.
> And there will have to be a tie-in between the meta-data commits and
Yes. Each new commit records the sequence number of the oldest commit
that should be replayed. So the train lays down track in front of
itself and pulls it up again when the caboose passes. For now, there
is just one linear sequence of commits, though I could see elaborating
that to support efficient clustering.
One messy detail: each forward log transaction is written into free
space wherever physically convenient, but we need to be sure that that
free space is not allocated for data until log rollup has proceeded
past that transaction. One way to do this is to make a special check
against the list of log transactions in flight at the point where
extent allocation thinks it has discovered a suitable free block, which
is the way ddsnap currently implements the idea. I am not sure whether
I am going to stick with that method for Tux3 or just update the disk
image of the free tree to include the log transaction blocks and
somehow avoid logging those particular free tree changes to disk. Hmm,
a choice of two ugly but workable methods, but thankfully neither
affects the disk image.
> :Once I get down to the leaf level I binary search on logical address in
Yes, that is my expectation. I think everything will perform fine up
to a few hundred snapshots without special optimization, which is way
beyond current expectations. Most users only recently upgraded to
journalling filesystems let alone having any exposure to snapshots
at all.
> /var/log/messages
Right, when writing the file takes longer than the snapshot interval.
> Even with a direct bypass for data blocks (but not their meta-data,
For Tux3 with an hourly snapshot schedule that will only be four or
five versions of the [mtime, size] attribute to reflect the 4.7 hours
write time or so, and just one version of each block pointer.
> :The penultimate inode table index block tells me how many blocks a
The inode table block is split at a boundary between inodes.
An "inode" is broken up into attribute groups arranged so that it makes
sense to update or version all the members of an attribute group
together. The "standard" attribute group consists of mode, uid, gid,
ctime and version, which adds up to 16 bytes. The smallest empty inode
(touch foo) is 40 bytes and a file with "foo" in it as immediate data
is 64 bytes. This has a standard attribute, a link count attribute, a
size/mtime attribute, and an immediate data attribute, all versioned.
An inode with heavily versioned attributes might overflow into the next
btree leaf as described elsewhere.
Free inodes lying between active inodes in the same leaf use two bytes
each, a consequence of the inode leaf directory, which has a table
at the top of the leaf of two byte pointers giving the offset of each
inode in the leaf, stored backwards and growing down towards the inode
attributes. (This is a slight evolution of the ddsnap leaf format.)
> I'm sure this ties into the forward-log but even with the best algorithms
Split the inode in memory, allocating a new buffer; forward log the two
new pieces physically out to disk with a logical record in the commit
block recording where the two new pointers are to be inserted without
actually inserting them until a logical rollup episode is triggered.
> :> I couldn't get away from having a delete_tid (the 'death version
I can think of three common cases of files that get a lot of historical
modifications:
* Append log
- The first iteration in each new version generates a new size
attribute
* Truncate/rewrite
- The first iteration in each new version generates a new size
attribute and either a new file index root or a new immediate
data attribute
* Database
- The first file change in each new version generates a new
versioned pointer at the corresponding logical address in the
file btree and a new size attribute (just because mtime is
bundled together with the size in one attribute group).
So the only real proliferation is the size/mtime attributes, which gets
back to what I was thinking about providing quick access for the
"current" version (whatever that means).
> :I was originally planning to keep all versions of a truncate/rewrite
Yes, the proposal is to treat truncation to exactly zero as a special
case. It is a tiny amount of extra code for what is arguably the most
common file rewrite case.
> With HAMMER I chose to keep everything in one B-Tree, whether historical
Essentially I chose the same strategy, except that I have the file
trees descending from the inode table instead of stretching out to the
side. I think this gives a more compact tree overall, and since I am
using just one generic set of btree operations to handle these two
variant btrees, additional code complexity is minimal.
> :> Both numbers are then needed to
There is a versioned link count attribute in the inode. When a link
attribute goes to zero for a particular version it is removed from the
inode. When there are no more link attributes left, that means no
directory block references the inode for any version and the inode may
be reused.
Note: one slightly bizarre property of this scheme is that an inode
can be reused in any version in which its link count is zero, and the
data blocks referenced by different versions in the same file index
can be completely unrelated. I doubt there is a use for that.
> Directories in HAMMER are just B-Tree elements. One element per
Ext2 dirents are 8 bytes + name + round up to 4 bytes, very tough to
beat that compactness. We have learned through bitter experience that
anything other than an Ext2/UFS style physically stable block of
dirents makes it difficult to support NFS telldir cookies accurately
because NFS vs gives us only a 31 bit cookie to work with, and that is
not enough to store a cursor for, say, a hash order directory
traversal. This is the main reason that I have decided to go back to
basics for the Tux3 directory format, PHTree, and make it physically
stable.
In the PHTree directory format lookups are also trivial: the directory
btree is keyed by a hash of the name, then each dirent block (typically
one) that has a name with that hash is searched linearly. Dirent block
pointer/hash pairs are at the btree leaves. A one million entry
directory has about 5,000 dirent blocks referenced by about 1000 btree
leaf blocks, in turn referenced by three btree index blocks (branching
factor of 511 and 75% fullness). These blocks all tend to end up in
the page cache for the directory file, so searching seldom references
the file index btree.
I did some back of the envelope calculations of the number of cache
lines that have to be hit for a lookup by Hammer with its fat btree
keys and lower branching factor, vs Tux3 with its high branching
factor, small keys, small pointers, extra btree level for dirent
offset stability and stupidly wasteful linear dirent search. I will
not bore the list with the details, but it is obvious that Tux3/PHTree
will pass Hammer in lookup speed for some size of directory because of
the higher btree fanout. PHTree starts from way behind courtesy of the
32 cache lines that have to be hit on average for the linear search,
amounting to more than half the CPU cost of performing a lookup in a
million element directory, so the crossover point is somewhere up in
the millions of entries.
Thus annoyed, I cast about for a better dirent leaf format than the
traditional Ext2 directory block, and found one after not too much
head scratching. I reuse the same leaf directory format as for inode
leaf blocks, but this time the internal offsets are sorted by lexical
name order instead of inode number order. The dirents become Ext2
dirents minus the record number format, and minus the padding to
four byte alignment, which does not do anything useful. Dirent inum
increases to 6 bytes, balanced by saving 1.5 pad bytes on average,
so the resulting structure stores about 2% fewer dirents than the
Ext2 format against a probable 50% reduction in CPU latency per lookup.
A fine trade indeed.
The new format is physically stable and suitable for binary searching.
When we need to manage space within the leaf for creating or deleting
an entry, a version of the leaf directory ordered by offset can be
created rapidly from the lexically sorted directory, which occupies at
most about six cache lines. The resulting small structure can be
cached to take advantage of the fact that most mass creates and deletes
keep hitting the same dirent block repeatedly, because Tux3 hands the
dirents to getdents in physical dirent order.
I concluded that both Hammer and PHTree are exceptionally fast at name
resolution. When I have the new dirent block format in place I expect
to really hammer Hammer. But then I do not expect Hammer to stand
still either ;-)
> Your current directory block method could also represent a problem for
A PHTree directory grows like an append-only file, a block at a time,
though every time a new entry is added, mtime has to change, so mtime
changes more often than size. They are grouped together in the same
attribute, so the distinction is moot. If the directory is modified at
least once after every snapshot (quite likely) then the snapshot
retention policy governs how many mtime/size attributes are stored in
the inode.
Say the limit is 1,024 snapshots, then 16K worth of those attributes
have to be stored, which once again encourages me to store the "current"
mtime specially where it can be retrieved quickly. It would also be a
good idea to store the file data attribute in front of the mtime/size
attributes so only stat has to worry about the bulky version info, not
lookup.
Note that there is no such thing as revving an entire inode, only bits
and pieces of it.
For truly massive number of versions, the plan is to split up the
version tree into subtrees, each having a thousand versions or so. For
each version subtree (hmm, subversion tree...) there is a separate
inode btree carrying only attributes and file data owned by members of
that subtree. At most log(subtrees) of those tables need to be
accessed by any versioned entity algorithm. (Note the terminology
shift from versioned pointers to versioned entities to reflect the fact
that the same algorithms work for both pointers and attributes.) I
think this approach scales roughly to infinity, or at least to the
point where log(versions) goes vertical which is essentially never. It
requires a bounded amount of implementation effort, which will probably
be deferred for months or years.
Incidentally, PHTree directories are pretty much expand-only, which
is required in order to maintain physical dirent stability. Compaction
is not hard, but it has to be an admin-triggered operation so it will
not collide with traversing the directory.
> :> Both numbers are
It is fine for diving into old snapshots. Lookup is always O(elements)
where an element is a versioned pointer or (later) extent, which are
very compact at 8 bytes each. Because the truncate/rewrite file update
case is so common, you will rarely see more than one version at any
given logical address in a file. A directory growing very slowly could
end up with about 200 different versions of each of its final few
blocks, one for each added entry. It takes on the order of a
microsecond to scan through a few hundred versioned pointers to find
the one that points at the version in which we are interested, and that
only happens the first time the directory block is read into the page
cache. After that, each reference to the block costs a couple of
lookups in a page cache radix tree.
The version lookup algorithm is not completely obvious: we scan through
the list of pointers looking for the version label that is nearest the
version being accessed on the path from that version to the root of the
version tree. This potentially expensive search is accelerated using a
bitmap table to know when a version is "on the path" and a pre-computed
ord value for each version, to know which of the versions present and
"on the path" for a given logical address is furthest from the root
version. This notion of furthest from the root on the path from a given
version to the root implements data inheritance, which is what yields
the compact representation of versioned data, and is also what
eliminates the need to explicitly represent the death version. You have
discovered this too, in a simpler form. I believe that once you get the
aha you will want to add versions of versions to your model.
> For informational purposes: HAMMER has one B-Tree, organized using
That is really sweet, provided that you are ok with the big keys. It
was smart to go with that regular structure, giving you tons of control
over everything and get the system up and running early, thus beating the
competition. I think you can iterate from there to compress things and
be even faster. Though if I were to presume to choose your priorities
for you, it would be to make the reblocking optional.
> The create_tid is the creation transaction id. Historical accesses are
So you have to search through a range of TIDs to find the operative one
for a particular version, just as I have to do with versioned pointers.
Now just throw in a bitmap like I use to take the version inheritance
into account and you have versioned pointers, which means snapshots of
snapshots and other nice things. You're welcome :-)
I think that versioned pointer lookup can be implemented in O(log(E))
where E is the number of versioned pointers (entities) at the same
logical address, which would put it on a par with btree indexing, but
more compact. I have sketched out an algorithm for this but I have not
yet tried to implement it.
> There is also a delete_tid which is used to filter out elements that
This is where I do not quite follow you. File contents are never
really deleted in subsequent versions, they are just replaced. To
truncate a file, just add or update size attribute for the target
version and recover any data blocks past the truncate point that
belongs only to the target version.
Extended attributes and dirents can be truly deleted. To delete an
extended attribute that is shared by some other version I will insert
a new versioned attribute that says "this attribute is not here" for
the current version, which will be inherited by its child versions.
Dirent deletion is handled by updating the dirent block, possibly
versioning the entire block. Should the entire block ever be deleted
by a directory repacking operation (never actually happens on the vast
majority of Linux systems) that will be handled as a file truncate.
I see that in Hammer, the name is deleted from the btree instead, so
fair enough, that is actual deletion of an object. But it could be
handled alternatively by adding a "this object is not here" object,
which might be enough to get rid of delete_tid entirely and just have
create_tid.
> :Fair enough. I have an entirely different approach to what you call
I plan to scan the whole btree initially, which is what we do in ddsnap
and it works fine. But by looking at the mtime attributes in the inode
I can see in which versions the file data was changed and thus not have
to scan most files. Eventually, building in some accelerator to be able
to skip most inode leaves as well would be a nice thing to do. I will
think about that in the background.
> HAMMER's mirroring basically works like this: The master has synchronized
That is nearly identical to the ddsnap replication algorithm, and we
also send the list over a pipe. Ddsnap does not deal with attributes,
only volume blocks, otherwise I think we arrived at the same thing.
> To avoid having to scan the entire B-Tree I perform an optimization
Hmm. I have quite a few bits available in the inode table btree node
pointers, which are currently only going to be free inode density
hints. Something to think about.
> This allows incremental mirroring, multiple slaves, and also allows
Very, very nice.
> The advantage is all the cool features I got by doing things that way.
This is a job for forward logging :-)
> You may want to consider something similar. I think using the
There is a slight skew in your perception of the function of forward
logging here, I think. Forward logs transactions are not long-lived
disk objects, they just serve to batch together lots of little updates
into a few full block "physical" updates that can be written to the
disk in seek-optimized order. It still works well for this application.
In short, we propagate dirty bits up the btree while updating the btree
disk blocks only rarely. Instead, the node dirty bits are tucked into
the commit block of the write transaction or namespace edit that caused
the change, and are in turn propagated into the commit block of an
upcoming physical rollup, eventually working their way up the on-disk
btree image. But they are present in the cached btree image as soon as
they are set, which is the structure consulted by the replication
process.
> :I intend to log insertions and deletions logically, which keeps each
I hope that question is cleared up now.
> With HAMMER I have an in-memory cache and the on-disk B-Tree. Just
Tux3 also has an in-memory cache and an on-disk btree, and in addition,
the in-memory cache is identical to some future version of the on-disk
btree, which ought to be good for losing a lot of corner cases.
> :That reminds me, I was concerned about the idea of UNDO records vs
Things are a little different in Linux. The VFS takes care of most of
what you describe as your filesystem front end, and in fact, the VFS is
capable of running as a filesystem entirely on its own, just by
supplying a few stub methods to bypass the backing store (see ramfs).
I think this is pretty much what you call your front end, though I
probably missed some major functionality you provide there that the
VFS does not directly provide.
A filesystem backend on Linux would be the thing that implements the
prepare_write/commit_write calls that come in from the one-size-fits all
generic_file_buffered_write function that most filesystems use to
implement write(2). These functions generally just call back into the
block library, passing a filesystem-specific get_block callback which
the library uses to assign physical disk sector addresses to buffers
attached to the page. Ext3 opens a journal transaction in prepare_write
and finishes it in commit_write. Not much of a transaction if you ask
me. Tux3 wants to do things in a more direct way and in bigger chunks.
Actually, all Tux3 needs to do with the prepare_write/commit_write
calls that generic_file_buffered_write throws at it is remember which
inodes where involved, because the inode keeps a list of dirty pages,
the same ones that prepare_write/commit_write is trying to tell us
about. When there are enough of them, on one of the commit_write calls
Tux3 can go delving into its file index btree to find or create a place
to store some. This will generate changes to the btree that must be
logged logically in the commit block of the write transaction that is
being set up, which now gets its sequence number so that the logical
changes can be applied in the same order if replay is needed.
At the point Tux3 is asked to generate some new logical changes, it may
decide that it has already sent enough logical changes onto disk and
now would be a good time to roll some of them up. The rolled up block
images are sitting in the buffer cache where Tux3 has prudently locked
them by incrementing the page use count. Tux3 now sets up some physical
log transactions to write out the images. Before doing so, it copies
each image to a new location in the buffer cache and modifies pointers
in the parent images to point at the new images (possibly having to read
in the parents first). This generates new logical changes (the changes
to the parent nodes and changes to the free tree to allocate the new
buffer cache positions) which will be logically logged in the commit
blocks of the physical transaction about to be committed.
When the rollup transactions have been sent to disk, Tux3 can continue
allocating disk space etc by updating the new images of the disk blocks
and setting up new transactions, but no transactions that depend on the
rolled up state of the disk blocks can be allowed to complete until the
rollup transactions have completed, which Tux3 learns of by counting the
interrupt mode ->endio calls that come back from the bio transfers.
Note: no other filesystem on Linux current works this way. They all
pretty much rely on the block IO library which implements the fairly
goofy one-block-at-a-time transfer regimen. The idea of setting up bio
transfers directly from the filesystem is unfortunately something new.
We should have been doing this for a long time, because the interface
is quite elegant, and actually, that is just what the block IO library
is doing many levels down below. It is time to strip away some levels
and just do what needs to be done in the way it ought to be done.
Notice how recursive the whole idea is. Logical transactions cause
physical transactions which cause logical transactions etc. I am
taking it on faith that the recursion terminates properly and does not
get into cycles. Over time I will give concrete reasons why I think it
all just works.
Digression done, back to the Hammer algorithms...
> When the backend decides to flush the cached meta-data ops it breaks
Hmm, and I swear I did not write the above before reading this paragraph.
Many identical ideas there.
> When the flush group has completed any remaining UNDO is flushed to the
That is pretty close to the mark, but there are UNDO records in Tux3,
there are only logical and physical REDO records.
The forward log is supposed to try to avoid syncing the beginning of
the chain to a known location as much as possible, which it does by
having at least one transaction in its pipeline that has not been
committed to disk yet and is part of an existing forward chain. The
chain is extended by allocating a location for the commit block of a
new transaction being formed, storing that location in the commit
block of the existing transaction, then committing the existing
transaction to disk.
> The forward-log approach is definitely more fine-grained, particularly
I think you will like it even more when you realize that updating the
filesystem header is not required for most transactions.
> :> HAMMER's B-Tree elements are probably huge compared to Tux3, and that's
A versioned extent:
struct extent { unsigned version:10, count:6, block:24; };
Directory format to index the extent within a leaf:
struct entry { unsigned loglo:24, offset:8 };
struct group { unsigned loghi:24, count:8 };
Leaf layout (abridged):
struct leaf { unsigned groups; struct group[]; struct entry[]; struct elements[] };
Extents are divided into groups, all of which have the same upper 24 bits
of logical address. The 8 bit offset gives the position of the first
extent with that logical offset, measured in 8 byte units from the base
of the extent group. The difference between two successive offset
bytes is the number of extents with the same logical offset. The first
offset byte gives the total size of the group, and would otherwise be
always be zero and that would be a waste. The base of the ith group is
determined by totalling the zeroth offset bytes.
The 48 bit logical address is split between the 32 bit dictionary and
the 32 bit group entry. This format gives the ability to binsearch on
both entries and groups. The number of groups per leaf should be just
one or two at deep leaves where considerable splitting of the logical
address has already occurred, so the overhead of the dictionary is just
32 bits per extent roughly, giving a total overhead of 12 bytes per
extent. Shallow btrees will approach 16 bytes per extent, but the tree
is shallow anyway so it does not matter. The shallowest tree consists
of just a few extents recorded as an attribute in the inode btree. I
introduce the rule that for such an extent attribute, the extents
belonging to each version are presumed to be logically contiguous, so
no directory is needed and the extent overhead drops to 8 bytes in this
common case. Actually, most files should just consist of a single
extent. That could be a new attribute form, a size/extent attribute,
with the mtime assumed to be the same as the ctime, which I think would
cover a majority of files on the system I am using right now.
See, I am really obsessive when it comes to saving bits. None of these
compression hacks costs a lot of code or has a big chance of hiding a
bug, because the extra boundary conditions are strictly local.
> I think the cpu overhead is going to be a bit worse then you are
We shall see. Factors working in my favor are:
* Snapshots are explicit, so you have to create lots of them plus
heavily update the data in order to make the versioned data bushy.
* The linux page cache not only caches data but provides an efficient
access path to logical blocks that avoids having to consult the
filesystem metadata repeatedly under heavy traffic.
* Compact metadata entities including versioned attributes and extents
(pointers for now)
* The kind of load that people put on a Netapp is a trivially small
number of versions for Tux3.
> Are you going to use a B-Tree for the per-file layer or a blockmap?
Btree, which I call the file index btree.
> And how are you going to optimize the storage for small files?
Immediate data is allowed in an inode as a versioned attribute.
> Are you just going to leave them in the log and not push a B-Tree
It did not occur to me to leave them in the logical log, but yes that
is a nice optimization. They will only stay there for a while, until a
rollup is triggered, moving them into the inode table proper. It gets
a little tricky to figure out whether the amortization due to the
logical logging step is worth the extra write. There is a crossover
somewhere, maybe around a hundred bytes or so, ideal for symlinks and
acls.
> :I might eventually add some explicit cursor caching, but various
For rotating media, true. There is also flash, and Ramback...
> But for cached data sets, watch out! The cpu overhead of your B-Tree
I found that from my HTree experience. I found that pretty much the
entire cost of the bad old linear dirops was CPU and I was able to show
a ridiculous speedup by cutting the cost down to O(log511(ops)) from
O(ops^2).
> For large fully cached data sets not caching B-Tree cursors will
OK, you convinced me. I already have btree cursors (see ddsnap path[])
that are so far only used for per operation btree access and editing.
So these will eventually become cache objects.
> :> If I were to do radix compression I would also want to go with a
Do the variable sized page hack :-]
Or alternatively look at the XFS buffer cache shim layer, which
Christoph manhandled into kernel after years of XFS not being accepted
into mainline because of it.
> :> I'd love to do something like that. I think radix compression would
Let me see, for Tux3 that is 64 bytes for the inode part including a
tiny amount of data, plus 32 bytes or so for the dirent part, say 100
bytes. This is my reward for being an obsessive bit mizer.
> The bigger cost is when creating and managing a large file. A 1 gigabyte
With pointers it is about a megabyte per gigabyte for Tux3 too (and
Ext2 and Ext3) so going to extents is not optional. I plan a limit of
64 blocks per extent, cutting the metadata overhead down to 16K per
gigabyte, which is hard to complain about. Larger extents in the file
index would not buy much and would mess up the nice, 8 byte
one-size-fits-all versioned extent design.
> Adding a forward log to HAMMER is possible, I might do it just for
Waiting for me to prove the idea seems reasonable.
> :He did remind me (via time travel from year 2000) of some details I
That is nice, but since I want to run without a pruner I had better
stick with the logical logging plan. The link count of the inode will
also be committed as zero, or rather, the link attribute will be
entirely missing at that point, which allows for a cross check.
> The budding operations is very interesting to me...
I was thinking more about the operation of taking somebody's home
directory and turning it into a separate volume, for whatever reason.
Tux3 snapshots (versions) have that write/rollback ability, and also
have it for snapshots of snapshots, just in case that was not clear.
> Hmm. I could definitely manage more then one B-Tree if I wanted to.
Yes, it sounds nice.
> I can tell you've been thinking about Tux for a long time. If I
The lookup path is actually quite simple, I am far enough in to know
that. The versioning algorithms are the most complex piece from an
algorithmic point of view, and as you can see from the sample
implementation, they ha
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
| Greg Kroah-Hartman | [PATCH 007/196] Chinese: add translation of stable_kernel_rules.txt |
| david | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Jan Engelhardt | intel iommu (Re: -mm merge plans for 2.6.23) |
git: | |
| Alexey Dobriyan | Re: [GIT]: Networking |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | Re: [BUG] New Kernel Bugs |
