Let's back up a little bit from "Caclulating tree node". What are the
elements of git's data structures?Right now we have an index structure (tree nodes) integrated in to a
base table. Integrating indexing into the data is not normally done in
a database. Doing a normalization analysis like this may expose flaws
in the way the data is structured. Of course we may also decide to
leave everything the way it is.What about the special status of a rename? In the current model we
effectively have three tables.commit - a set of all SHAs in the commit, previous commit, comment, author, etc
blob - a file, permissions, etc.
file names - name, SHAThe file name table is encoded as an index and it has been
intermingled with the commit table.Looking at this from a set theory angle brings up the question, do we
really have three tables and file names are an independent variable
from the blobs, or should file names be an attribute of the blob?How this gets structured in the db is an independent question about
how renames get detected on a commit. The current scheme for detecting
renames by comparing diffs is working fine. The question is, once we
detect a rename how should it be stored?Ignoring the performance impacts and looking at the problem from the
set theory view point, should:
the pathnames be in their own table with a row for each alias
the pathnames be stored as an attribute of the blobBoth of these are the same information, we're just looking at how
things are normalized.--
Jon Smirl
jonsmirl@gmail.com
-
There isn't a one-to-one mapping of file names to blobs. The blob only
describes the contents of the file. In the extreme case you could have
one blob for every single file in your tree. For example:# git ls-tree -r HEAD
100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c bar/foo
100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo
100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo2
100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo3
100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo4
100644 blob 05303ef858aeeb01ca40590dd6fe65928096ee6c foo5--
Julian---
"You shouldn't make my toaster angry."
-- Household security explained in "Johnny Quest"
-
Both schemes support aliasing. In the flat scheme you would create a
second blob which contains the file and the aliased path name. When
the blob gets delta'd the second copy of the file will disappear.I'm not proposing a change to data being stored in git, it is a
proposal to consider the impacts of how this data has been normalized--
Jon Smirl
jonsmirl@gmail.com
-
But to what end?
We all *know* the impacts:
* Excellent performance at what it does now.
* Currently zero capability to replace google as the #1 search engine.Since replacing google's db was never, and will never, be the goal of
git, what is it you wish to achieve? Seriously, I'm dying to know, so
please tell me. If you have already and I'm too daft to understand it,
humor me and reiterate :-)--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-
Another way of looking at the problem,
Let's build a full-text index for git. You put a string into the index
and it returns the SHAs of all the file nodes that contain the string.
How do I recover the path names of these SHAs?--
Jon Smirl
jonsmirl@gmail.com
-
That question does not make much sense without specifying "which
commit's path you are talking about".If you want to encode such "contextual information" in addition
to "contents", you could do so, but you essentially need to
record commit + pathname + mode bits + contents as "blob" and
hash that to come up with a name.
-
I left the details out of the full-text example to make it more
obvious that we can't recover the path names.Doing this type of analysis may point out that even more fields are
missing from the blob table such as commit id.The current data store design is not very flexible. Databases solved
the flexibility problem long ago. I'm just wondering if we should
steal some good ideas out of the database world and apply them to git.
Ten years from now we may have 100GB git databases and really wish we
had more flexible ways of querying them.The reason databases don't encode the fields into the index is that
you can only have a single index on the table if you do that.
Databases do sometimes duplicate the field in both the index and the
table. Databases also have the property that indexes are just a cache
and can be dropped at any time.--
Jon Smirl
jonsmirl@gmail.com
-
Databases solved the flexibility problem, at the cost of performance.
And if you use full normalized form in your database scheme, it costs
you even more in performance, because of all of the joins that you
need in order get the information you need to do, you know, useful
work as opposed to database wanking.If you take a look at the really big databases with super high
performance requirements, say like those used to managed airline
tickets/reservation/fares, you will find that they are not normalized,
and they are not relational; they can't afford to be. And if you take
a look at some of git competition that use relational databases to
store their SCM data, and take a look at how loooooong they they take
to do even basic operations, I would say that the onus is on you to
prove that normalization is actually a win in terms of real (not
theoretical) advantages, and that it doesn't cause performance to go
into the toilet.I think the fundamental disconnect here is that no one is buying your
claim that just because the data design is "more flexible" that this
is automatically a good thing in and of itself, and we should even for
a moment, "put performance aside".I also don't think that attempting to force git's data structures into
database terms makes sense; it is much closer to an filesystem using
an object based store --- and very few people except for folks like
Hans Resiers believes that Filesystems and Database should be
unified....- Ted
-
It is very easy to get bogged down in performance arguments on
database design when the correct answer is that there are always lots
of different ways to achieve the same goal. I wanted to defer debating
performance until we closely looked at the relationships between the
data at an abstract level.Since git hasn't stored all of the fields in the object table (the
path is encoded in the index) we are never going to be able to build
an alternative way of indexing the object table. Not being able to
build alternative indexes is likely to cause problems when the
database starts getting really big. Without an index every query that
can't use the path name index is reduced to doing full table scans.A few things that could benefit from alternative indexing, blame,
full-text search, automating the Maintainers file, etc.I'm just asking if we really want to make full table scans the only
possible way to implement these types of queries. If the answer is no,
then let's first explore how to fix things at an abstract level before
diving into the performance arguments.An obvious parallel from the file system world is the locate database
and how it is forced to continuously rescan the file system and store--
Jon Smirl
jonsmirl@gmail.com
-
But you cannot. Git is performance-critical, for the same reason every
other performance-critical application is: It's a tool to save human
time. Linux development *could* be done using patchfiles by the bundle
and masses of tarballs. It's just not the fastest way to do it, so enter
git, and lots of problems just go away. It's not the only way of doing
it, but it saves time. If you were to add 2 seconds to each commit,We can still build alternative indexes. They just have to be separate
from the DAG and the current indexing scheme. Junio has pointed outI've said it before; The most common delimiter used today is paths. It's
a behaviour git was designed to handle well, because it *is* the most
common way of limiting and separating content. It's not some random
fluke that has made git perform very well on actions that commonly
performed in large scale software projects; Linus designed it that wayYes, but getting rid of the tree objects and storing pathnames in
blob objects would penalize log-viewing, diffs and merges, which
are far more common operations than full-text searches in a softwarePersonally, I really don't care. But you should really have read Junio's
mail a bit more carefully. He explained about 'notes' that can be attached
to commits and contain arbitrary data. By all means, create your indexes
there and use them for whatever you like, but leave the foundation on which
git was built *alone*. The design hasn't changed since April 2006 (subtrees
were introduced April 26, I think), because it's a *good* design.--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231-
This is why I wanted to separate the abstract data structure design
discussion from the performance one. In the flat design indexes are
like caches and can be created and destroyed. There will definitely be
an index created on the the paths. This index will work like the
current tree nodes. The difference is that this index is a cache
unlike the current tree nodes which are an immutable part of the the
data base.The path name field needs to be moved back into the blobs to support
alternative indexes. For example I want an index on the Signed-off-by
field. I use this index to give me the SHAs for the blobs
Signed-off-by a particular person. In the current design I have no way
of recovering the path name for these blobs other than a brute force--
Jon Smirl
jonsmirl@gmail.com
-
Erm, if that's your only way then you designed your index incorrectly.
1. Signed-Off-By lines appear in commits, so your index should be an index
of SOB name against commit hash
2. Lookup the commit for that commit hash. As usual this is blindlingly
git-fastic.
3. That commit blob contains a tree hash. Look it up. As usual this is
blindingly git-fastic
4. Start gathering blobs for that tree. Fast, fast, fast.
5. Any subtree objects you come across, goto 4.This is not a brute force lookup and it's stuff that git is really good at
anyway.I'm really not sure I see what problem you're trying to solve. Whatever
index you want, you could keep and maintain if you wanted to without
impacting git's core storage at all.Andy
--
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com
-
Ah, there we go. A use-case at last :)
So now we have a concrete problem that we can formulate thus:
"How can one create a database listing the relationship between 'signers'
and blobs?"So the second question: Do you seriously argue that git should take a
huge performance loss on its common operations to accommodate a need that
I suspect very few people have?--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-
Why do you keep jumping to a performance loss? Both schemes will have
an index based on paths. The problem is how those indexes are
constructed, not the existence of the index. Moving the paths into the
blobs in no way prevents you from creating an index on that field.The problem is that the SHAs have been intertwined with the tree
nodes. This blending has made it impossible to create other indexes on
the blobs.The path index in the flat scheme will probably look just like tree
nodes do today but these new tree nodes won't be intertwined with the--
Jon Smirl
jonsmirl@gmail.com
-
But not a brilliant one. You sign off on commits not blobs. So you go
from the sign-off to paths, then to blobs. There is no need to go fromBut moving the path into the blob _IS_ the perfomance hit. You lose the
ability to tell the two files have the same content _without even looking
at the blob_. This is one of the core parts of making git operations
blindingly fast. You can't throw that out, and then say that there is no
performance hit.You keep talking about abstract database performance - but git is not an
abstract database. It has very specific common usage patterns, and isAnd you will have to prove that diff/merge etc. don't become very much
slower before you get buy in.--
Julian---
Many receive advice, few profit by it.
-- Publilius Syrus
-
Use blame for an example. Blame has to crawl every commit to see if it
touched the file. It keeps doing this until it figures out the last
author for every line in the file. Worse case blame has to crawl every--
Jon Smirl
jonsmirl@gmail.com
-
Sure. Build a quick dedicated index for that and measure
- cost (size and commit/fetch costs)
- benefit
- frequency of usagegit is a special-purpouse DB that does great for certain access
patterns. Have a look at monotone for a design that looks a lot like
git but is backed by a general purpouse DB and does equally poorly forYep. Can we get a minimal-cost index with just enough hints that can
speed up blame, and perhaps git log with/very/deep/path? Probably!That's worth pursuing sure.
martin
-
Hi,
But you can add _yet another_ index to it, which can be generated on the
fly, so that Git only has to generate the information once, and then reuse
it later. As a benefit of this method, the underlying well-tested
structure needs no change at all.BTW could you please, please, please cut the quoted message that you are
_not_ responding to? It really _wastes_ my time.Ciao,
Dscho
-
And in fact, you can do this today, without modifying git-blame at all,
by (ab)using its "-S" option (which lets you specify a custom ancestry
chain to search). By coincidence, I was just showing some people at my
office how to do this yesterday. I'll cut-and-paste from the email I
sent them. I am not claiming this is nearly as desirable as a built-in,
auto-updated secondary index, but it proves the concept, anyway.Fast-to-generate version:
git-rev-list HEAD -- main.c | awk '{if (last) print last " " $0;
last=$0;}' > /tmp/revlistThis speeds things up a lot, because git blame doesn't have to examine
other revisions:time git blame main.c
1.56s user 0.30s system 99% cpu 1.868 total
time git blame -S /tmp/revlist main.c
0.21s user 0.03s system 96% cpu 0.249 totalThe bad news is that generating that revision list is a bit slow, and if
you do it the naive way I suggested above, you can't use the rev list
with the -M option (to follow renames). The good news is that it's
possible to have that too if you generate a list of revisions that
includes the renames:# Generate a list of all revisions in the right order (only need to do
this once, not once per file)
git rev-list HEAD > /tmp/all-revs
# Generate a list of the revisions that touched this file, following
copies/renames.
# Could do this in fewer commands but this is hopefully easier to follow.
git blame --porcelain -M main.c | \
egrep '^[0-9a-f]{40}' | \
cut -d' ' -f1 | \
fgrep -f - /tmp/all-revs | \
awk '{if (last) print last " " $0; last=$0;}' > /tmp/revlistThen -M is fast too:
time git blame -M main.c
1.72s user 0.27s system 89% cpu 2.219 total
time git blame -M -S /tmp/revlist main.c
0.29s user 0.03s system 93% cpu 0.341 totalOddly, if you use the -S option, "git blame -C" actually gets
significantly *slower*. I am not sure why.-Steve
-
Estimated daily uses of git-blame, world-wide: few
Estimated daily uses of git-{merge,diff}, worldwide: lotsCode and benchmarks, or I'm not buying it.
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-
Which makes the author of git-blame weep X-<.
The real issue is that embedding pathname in blob does _not_
help "git blame" but would actively hurt it. A file with the
identical contents moved between the parent to child commit
shares the same blob object and same object name in the real
git. Jon's modified system that hashes pathname together with
the contents would have them as two completely unrelated objects
with different object names, which only means that even 100%
similarity rename case becomes as expensive to find as renames
of lower similarity, which needs to expand and look into blob
contents.
-
El 6/9/2007, a las 11:09, Junio C Hamano escribi
And why exactly would you need to change blobs to contain path for blame
to be faster ?Or more generally, what, in the current way of git doing things,
prevents you from adding an index to $THE_DATA_YOU_LIKE, exactly ?From the very few use cases you've given, I see nothing preventing to
create an additional index from the data git currently uses.Mike
-
And this is advantaged by having the path in the blob how? The important
information here is knowing which commits touched the file - this
information is expensive in git because it is snapshot based. You have to
go back through all the commits looking for changes to the given path.
The information you might want to cache is which commits touched the file,
which you could do without changing the current data storage. Presumably
you are suggesting that such a cache would be cleaner with the filename in
the blob? Or do you think that it would somehow be faster to create? If
so, how?--
Julian---
Humor in the Court:
Q: (Showing man picture.) That's you?
A: Yes, sir.
Q: And you were present when the picture was taken, right?
-
The only possible reason I can think of for moving data into the blob
would be to make a POSIX-compliant git-like filesystem, and EVEN THEN
you would NOT move the path out of the tree objects. In order to
have somewhat consistent inodes (and also for performance when
changing 4 bytes in a 40GB file) you would want to have 3 different
types of "inode" objects:1) 4-64k of (metadata + filedata)
2) 4-64k of (metadata + list of 4-64k filedata blobs)
3) 4-64k of (metadata + list of 4-64k lists of filedata blobs)On the other hand... that isn't GIT, it's something completely
different with a very different usage pattern and set of
requirements. And you still don't put the path name in the objects,
just the permissions and other attributes/metadata.<Random Thought Experiment>
You would of course want to better define those 4-64k limits for
allocation and performance reasons, but a double-indirect table of
SHA128s with 64kb chunks lets you address up to 1TB of file data, and
for each additional power-of-two increase in the chunk size you get 8
times the storage space. Furthermore, the actual double-indirect
tables for an 8TB file using 128k chunks would be all of 64MB, for a
more reasonable 4GB file with 32k tables (max of 128GB) it would be
maybe 128kB of indirect SHA1 hash tables.
</Random Thought Experiment>Cheers,
Kyle Moffett-
Quite the contrary. You just illustrated why it is wrong to put
anything but contents in the blob.The specialized indexing is a different issue. If you want to
have a full text index to answer "what paths in which commits
had this string?", then your database table would have columns
such as commit (sha-1), path (string) as values, indexed with
the search string.Now the current set of "git" operation does not need to answer
that query, so we do not build nor maintain such an index that
nobody uses. But your application may benefit from such an
index, and as others said, nobody prevents you from building
one.
-
The big difference between a database and git is that a database is a
general purpose tool. git has a much more restricted scope. As such, it
doesn't need *that much* flexibility.Mike
-
Databases are designed to be efficient at storing and accessing large
amounts of data. The key thing about a database is that it does not
track the *history* of the data it is storing. This is the main
problem with using a database as a metadata storage facility.Modern source control systems such as Perforce (and possibly
Subversion), use a database to track metadata such as branch/merge
history, user data and so on. This, IMHO is a huge weakness of these
SCM systems. It is impossible to fully roll back to a given point in
time, because that metadata is stored independently of the file
content tracking.Git *is not a database*. This is fundamental to understanding how git
works. Git stores *all* of its data in a Directed Acyclic Graph (with
the exception of the pointers to tag and the current head of each
branch, that it stores locally in the .git directory). Read
http://eagain.net/articles/git-for-computer-scientists/ for more
information on this.What this means is that for any commit, git has all the information it
needs about the repository at that point in time. It doesn't need
anything else. If you then store information in a database, you lose
having the complete picture at any point in the history of the
repository.- Reece
-
I wouldn't know, but presumably any table can have more than one column.
Is this a problem you face with git so often that it requires a complete
re-design of its very core?--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-
That's the whole point. We need to discuss the impact of merging a
field (path names) with an index (tree nodes) has on future things we
may want to do with the data stored in git.Databases don't usually blend fields/indexes without also duplicating
the field in the table. You need all the fields in the table so that--
Jon Smirl
jonsmirl@gmail.com
-
Yes, but as nobody seems to know what those future things are, it feels
rather pointless speculating about adding support to git for them. git
is a tool. It's a great one at that, because it was built to solve a
particular problem, which it does an amazing job at.Other SCM's which had the potential to become amazingly good tools too
drowned somewhere between prototype and product in a sea of intellectual
masturbation, which had little to do with solving real-world problems.--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-
commit - SHA1 of its parent(s) and its root-tree, along with
author info and a free-form field
blob - content addressable by *multiple trees*
file names - List of path-names inside a tree object.To draw some sort of relationship model here, you'd have
commit 1<->M roottree
tree M<->M tree
tree M<->M blobAssuming SHA1 never collides (collisions rule out any form of storage,
so we might as well hope it never happens), that leaves us with this:Each root tree can only ever belong to a single commit, unless you
intentionally force git to make completely empty commits. git
won't complain about this, so long as you don't make two in the
same second, because it relies more heavily on the DAG than on
developer sanity.Each root tree can point to multiple sub-trees. The sub-trees can be
linked to any number of root-trees.Blobs can be linked to any number of tree objects, or even multiple
times to the same tree object. This wouldn't be possible if theFile names are not independant variables. They belong inside the
Do you realize that you're contradicting yourself in two upon each
other following sentences here?Detecting renames after the fashion works fine. Not storing them
Except that
git init
echo foo > a
cp -a a b
git add .
git commit -m testing
git count-objectsyields 3 objects at the moment; A commit-object, a tree object and *one*
blob object. With your scheme the 2 blob objects would differ, and there
would be 4 of them. If you propose to ignore the path-name you have
effectively broken support for having two identical files with different
names in the same directory.Now, can you please tell me what gains you're hoping to see with this
new layout of yours?--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-
This actually can happen without even using 'ours' strategy.
If two people independently applied the same patch on their
branches and later their results were merged. And "the same
second" requirement is not even there and not interesting.
There are other things like developer identity, log message, and
their ancestry that would make the resulting commit objectMore importantly, in git, filenames and modes are not considered
part of "contents", which git tracks. Although it is an
entirely possible and valid alternate design to move that as
part of "blob" to build a system that is different from git,
which Jon seems to be aiming at, the benefit of such a design is
unclear to me, both from theoretical point of view (now blobs
are not about pure contents anymore) nor performance point of
view (Linus's done flat tree object in an early stage of git,
and it was not nice) as other people explained.
-
By introducing tree nodes you have blended a specific indexing scheme
into the data. There are many other ways the path names could be
indexed hash tables, binary trees, etc.This problem exists in files systems. Since the path names have been
encoded into the directory structures there is no way to query
something like "all files created yesterday" from a file system
without building another mapping table or a brute force search. I keep
using Google as an example, Google is indexing hierarchical URLs but
they do not use a hierarchical index to do it.Databases keep the knowledge of how things are indexed out of the
data. A data structure analysis of git should remove the blended index--
Jon Smirl
jonsmirl@gmail.com
-
It might help the discussion if you could point to a reference,
preferably one that discusses the trade-offs in the design, with more
concrete details about what google or other search engines actually
do. It would be particularly useful if it addressed issues of1. the type of queries the representation is optimised for.
2. consistency requirements. (Can a search engine use different data
structures if they improve average performance at the cost of
occasional inconsistency/lossage?)Finally, this design space is not totally unexplored, for example,
http://plan9.bell-labs.com/sys/doc/venti/venti.html
AFAICS they only use SHA-1 for blocks within files (although this
might be misreading the paper) so presumably they'd have knowledge
about the trade-offs.--
cheers, dave tweed__________________________
david.tweed@gmail.com
Rm 124, School of Systems Engineering, University of Reading.
"we had no idea that when we added templates we were adding a Turing-
complete compile-time language." -- C++ standardisation committee
-
That is correct. However, given that indexing scheme, many of the common
operations just "fall out" simply and efficiently, without the need to
keep separate indices. So yes, git is geared towards a particular set of
operations.Your complaint seems to be two-fold:
1. there is an inelegance in the blending of data and indexing. The
problem with changing this is:
a. we are all already using git, and it would require completely
re-vamping the core data structure
b. there is some feeling that the blending is necessary for
performance. Given the difficulty of (a), I think you would
have to provide compelling evidence (i.e., numbers) that a
git-like system based around set theory with separate indices
would perform as well.2. you want perform some operations to which the hierarchy is not
well-suited. In this case, I think you can get by with the same
solution you have proposed already: indices external to the data
structure (in fact, this is exactly what Google is doing: taking
hierarchical URLs and indexing them in different ways).Have you taken a look at the pack v4 work by Shawn and Nicolas? It
is an attempt to build such indices at pack time (but keeping the
core git data structure intact).-Peff
-
Pathnames are by far the most common search-/delimiting criteria for
Why? This is the core of the problem, really. You haven't specified a
single, real-life reason *why* it should be any other way than it
already is. It sounds a bit to me as if you've been to a really
inspiring seminar about "how database-like things *should* be done"
and then decided to go berserk on your favourite database-like thing,
which is git.Code and benchmarks or bust. In the meantime, I'll settle for a recount
of what problems you're having with the current layout, or what gains
you're hoping to achieve with the new one. As it's the 3rd time I'm
asking, this'll be the last.--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-
Actually, you don't need to be insane to have multiple commits pointing
at the same root tree. It is actually very easy:
- git clone
- do some stuff on your master branch and commit
- send your changes upstream
- upstream applies as is
- git pullYou now have everything merged, and the last commit on your master branch,
while being a different commit object due to its parenting, has the same
root tree as the tip of the remote branch.Mike
-
That explains why it felt so awkward writing that sentence. :)
Thanks for correcting me. Even so, one more M<->M relation-ship
certainly speaks for rather than against the current model.--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Mike Travis | [RFC 00/15] x86_64: Optimize percpu accesses |
| Dave Jones | agp / cpufreq. |
| Willy Tarreau | Re: [PATCH] tcp: splice as many packets as possible at once |
| Gerrit Renker | [PATCH 14/37] dccp: Tidy up setsockopt calls |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Natalie Protasevich | [BUG] New Kernel Bugs |
git: | |
