On Saturday 02 December 2006 22:22, Linus Torvalds wrote:
Thinking about this...
You have to make very sure to always update the caching layer containing
the backlinks on every addition of a further object. You can do this
because you always reached this new object by some other object, which
exactly is the backpointer.
Now let us suppose we are able to do this.
What does this give us?
Take a look at object traversal:
We have to store the flag "already visited" for objects we could reach
again in the traversal. But with the backlinks, we can see that most
of the objects can only be reached via one path, and therefore, there
is no need to store the flag, as it never will be queried in the
further traversal.
(Similar for objects with two paths: When you have visited the object
two times, you can throw away the flag, as it is not queried any more).
Regarding the caching layer and object traversal, it would have been
enough to only store "is this object reachable via more than 1 path?".
For this, the "cache" could be the set of objects reachable with
more than one path.
And such a set stored in a file should be quite managable, and be
quite small, relative to the size of the object database.
In fact, this "cache" can be created with a usual object traversal
(which has the original memory requirement), but as long as we do
not add objects to the database, further traversals would only need
a fraction of memory.
When only adding a small number of objects, it should be easy to
update the cache; while with big actions like fetching/pulling,
we simply should remove the file with the backlink information.
Again only some thoughts...
Pack files are fully self-contained object stores, yes?
So in the scope of a single pack file, the offset of this object is enough
as object identification.
If we could make sure that in any given algorithm touching objects, like
commit traversal, we always have the offset available when we need to do
an object lookup, then, it should be enough to store object flags only
indexed by the offset of this object in the pack.
The translation SHA1 -> offset can be done with the pack index.
As you usually have multiple packs, a (pack number / offset) tuple should
be enough as object ID.
Thinking even one step further:
Would it make sense to define an encoding format for the content of
commit and tree objects inside of packs, where the SHA1 is replaced by the
offset of the object in this pack?
As exactly the SHA1 is the least compressable thing, this could promise
quite a benefit.
AFAIK, we currently only use these offsets for referencing objects in
delta chains.
More about the original topic of this thread (and off-topic to the
new subject):
Without submodule identities, we would have to clone path-by-path, as
we can not distinguish different submodules apart from there location.
Josef
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html