Hi,
Shawn triggered some well needed thinking on my part about the notes
implementation. At the moment, we have flat directory structure, and read
all of them in one go (when needed).
I think we should support that, because it is relatively easy to generate
that kind of trees for small-scale applications.
However, I think there is also a benefit to handle fan-out directory
structures, too: they scale much nicer.
If the commit name was not found as a filename, it could be searched in
whatever subdirectory whose name is a prefix of said commit name (first
wins).
So I think it would be a sane plan to do the following when a commit note
is requested:
- If not done yet, read in the whole top-level directory of the notes ref.
- If the commit name is not found, find the tree entries whose name is a
prefix of the commit name (we can even use the same hashmap to store
these "incomplete" names, as we use a linear hash, which we fill in
ascending order),
- read the trees one by one, until the commit name is found (or no tree
entry is left), deleting the trees from the hashmap on the go.
How does that sound?
Ciao,
Dscho
--
Add 2 small lines to document that you can also use 'git rebase --interactive' to add new commits to the rebased patch-series. This is sort of running multiple 'git cherry-pick' commands in one go. Signed-off-by: Kjetil Barvik <barvik@broadpark.no> --- If this is a good idea, soneone should correct my poor english. -- kjetil git-rebase--interactive.sh | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh index 3dc659d..aa2e53c 100755 --- a/git-rebase--interactive.sh +++ b/git-rebase--interactive.sh @@ -734,6 +734,9 @@ first and then run 'git rebase --continue' again." # If you remove a line here THAT COMMIT WILL BE LOST. # However, if you remove everything, the rebase will be aborted. # +# If you add a new line with a command followed by the SHA1 ref of the +# patch, then that patch will be added to the list. +# EOF has_action "$TODO" || -- 1.6.1.349.g99fa5 --
Hi, If we start along those lines, we also have to add documentation how to - split commit, - change authorship of commits, - deal with commits where --cherry-pick did not detect that they were applied already, - deal with merge commits, and - possibly a lot more. I do not think that this is a good way to spend valuable screen estate; I think that is what the man page should cover. I only made an exception for the deletion of lines, as people were actively burnt. Yes, they should have read the man page. But the consequences of not doing so were bad. Ciao, Dscho --
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes: I agree! Ok, then the correct thing todo whould be to add some lines to the git-rebase.txt file in this case. -- kjetil --
Hi, That would be awesome! Ciao, Dscho --
So, something like a Trie data structure? I think that is a great way to=20 store fixed-length strings from a limited alphabet with arbitrary data=20 attached. =2D-=20 Boyd Stephen Smith Jr. ,=3D ,-_-. =3D. bss@iguanasuicide.net ((_/)o o(\_)) ICQ: 514984 YM/AIM: DaTwinkDaddy `-'(. .)`-' http://iguanasuicide.net/ \_/
I don't think a Trie quite makes sense here. We still have to look linearly through each git tree (an artifact of the tree implementation). You could organize the tree into a deeper, more complex data structure than just a simple fan-out. But remember that traditional data structures are usually trying to save expensive comparisons, and following a pointer is inexpensive. In the case of git trees, though, following a pointer into a subtree is _very_ expensive, since you have to lookup and decompress the object. So what we do now is read the tree into an associative hash. You could replace the hash with a trie, but it is not really the performance-critical part here. The issue is that without fan-out you have to read the _whole_ tree into the hash. With a constant-sized fanout, you get to divide that work by a constant. Or did you mean something else entirely? -Peff --
Perhaps it's not a traditional trie structure but that was the closest anal= ogy=20 I could come up with. I was actually thinking of something between a trie = and=20 a b-tree, I think. (It has been a long time since data structures class...) The issue, as I understand it, it that we don't have gargantuan tree object= s. =20 Reading and writing are slow and they'd also take up way to much memory if = you=20 are only trying to find a few commits. So, we figure out a maximum tree size that is reasonable, figure out a fan-= out=20 that prevents the tree from growing above that size, but *dynamically* appl= y=20 that fan-out. I.e. if the fanout is 2 characters, and we've added notes fo= r=20 both ff82730c and ff23abc0, then our tree would have ff/ -> some_tree_sha, = but=20 if we had only a note for the one one our tree would have ff82730c... ->=20 some_note_sha. Unlike .git/objects, we should probably also do dynamic fan= out=20 in subtrees. Yes, this would require a custom merge strategy for notes to flatten -> mer= ge=20 Yeah, that. While I'm throwing out crazy ideas, why not makes a notes tree look just li= ke=20 =2Egit/objects, including info and pack directories? =2D-=20 Boyd Stephen Smith Jr. ,=3D ,-_-. =3D. bss@iguanasuicide.net ((_/)o o(\_)) ICQ: 514984 YM/AIM: DaTwinkDaddy `-'(. .)`-' http://iguanasuicide.net/ \_/
That sounds unnecessarily complicated. It also really sucks for the case you want to optimize: small differences between trees, where you don't need to even linearize the common parts. Why not make it just a straight fixed 12-bit prefix, single-level trie. Sure, if you have less than 4k objects, it's going to add an unnecessary indirection, and close to an extra tree object for each object. But it should scale pretty well to a fairly huge numbe of notes. IOW, if you have less than 2^24 notes (16 million), you'll never have a tree object with more than 4k entries. And with each tree being ~70 bytes/object (40 bytes name, 20 bytes SHA1 + overhead), the individual tree objects will still be a reasonable(ish) size. And the fixed depth and prefix size means that merging is trivial and can use the normal tree merge that avoids touching common subtrees. The default .git/objects fan-out of just 8 bits might work too, but if we're thinking millions of notes (which is not entirely unreasonable), it gets ugly pretty fast. The reason it works ok for git is the repacking. Linus --
My solution suffers from that problem too, but I personally still don't think that the answer is to fix the trie boundary. The only case where it hurts is when you want to merge. Nothing else should care. So, if a merge of these note trees sees two different trie sizes then it can convert the shorter one to the longer length first, and then try the merge again. So you get the pain, but only once. And when a project decides that its split is too small, it can split then and it should "silently" spread out to others. Sam. --
But what's the advantage of the added complexity? The non-fixed trie only helps for the case that doesn't matter - just a few annotations. If you have a thousand annotations or less, you _really_ don't care. Whatever you do will be fine. So the whole thing only matters once you have tens of thousands of entries, and then you do want to have fan-out. No? Linus --
Yeah. I see your point and you may be right, that a 12/28 split hurts no-one, if we take this to the benchmarks. There's certainly savings in terms of total object count for the small users by using a smaller split. I just already wrote the code to handle an arbitrary split for the features written so far[1]. If *I* can write it, in C, it means it must not be that complicated ;-) So it comes down to how complicated things are when merging happens. If 12 is fixed in stone this is simple, because there are no chances for discrepancies. But refs/notes/commits still needs special treatment to be fetched, because it is not under refs/heads/* and you wouldn't normally have a working tree to resolve conflicts. So I think probably the most productive thing to do is for me to write the code to handle the merge as I described above, once the code to handle pulling in - and merging - notes at 'git fetch' time is written. Then we can see whether it's that much of a complication. To bench this we need the current builtin-log implementation to be re-written to be lazy. Which means we can't put it in the next release unless someone writes that. However my proposal means that we can release as we are and not care, and let some code - which I hope I have shown isn't *that* complicated, really - deal with it in a later release, and not break backwards compatibility. Sam. 1. see message <1233455960.17688.122.camel@maia.lan> --
Hi, I think I either missed your mail or had to ignore it due to too much day job work. It is a good first step, of course the next step would be to load the trees on-demand. Oh, and the best approach to handle the "to Trie or not to Trie" question would be to be strict in what we emit (12/28 it seems, by authority of Linus), and liberal in what we accept, IMHO. That is, accept whatever partition of the SHA-1, stopping on the first we found (smaller number of slashes, or when that is equal, the smaller first prefixes). We can always discuss ways to handle merging later, I guess. Ciao, Dscho --
Hmm. Do we really care about how easy it is to generate? Are we
expecting people to not use the command interface and instead check out
a notes tree and start putting stuff into $commit/foo?
And if we are encouraging the dual possibilities, how do we handle the
case of merging two trees with equivalent but differently-formatted
content?
Imagine I have three users, A, B, and C, all collaborating on a project
with notes. A and B use the "git notes" interface which generates a
fan-out directory structure. C uses his own script that directly writes
to the notes tree without fan-out.
Now let's imagine A, B, and C all write a note for commit X, and A pulls
from the other two. When he pulls from B, there is a file-level
conflict, and he decides that his note is better and resolves in his
favor. But when he pulls from C, there is _no_ conflict, and now there
are two notes for the same commit in his notes tree. You can give the
multiple notes some sane semantics (one trumps the other, or they are a
list, or whatever), but there is still an inconsistency: B's notes and
C's notes behave differently. So now A has to start caring about how
other people generate their notes.
The only two solutions I can think of are:
- when A pulls notes, he does a specialized merge that normalizes the
note trees
- particular notes trees are specified as being in "fan out" or "not
fan out" mode. But there is no place to specify that to enforce it.
-Peff
--
Hi, I wanted to be nice to existing users of the feature (it is in 'next', after all, and Thomas has produced some awesome examples, that will hopefully show the scalability of the thing). You're correct. This buys all kinds of trouble. Ciao, Dscho --
OK. I think if you are seeing performance benefits from a 2-character fanout, then we should standardize on that (do you have new performance numbers somewhere?). The notes implementation is now in master. If it's about to change in an incompatible way, how do you want to handle it? I'm wary of a quick patch to change the format this late in the release cycle. We could hold it back from 1.6.2. Alternatively, we could let it release with a "this is probably going to change" warning. One other thing to note: I think we discussed in the past other kinds of "more than one way to store it" strategies (e.g., letting a blob note be the same as a tree note containing a blob "default"). They suffer from some of the same issues (though not quite as badly, since you would at least see that there was a conflict). -Peff --
Hi, [Junio: seems like both Peff and me would like to hold the notes out of 1.6.2, would you mind?] The thing is: Shawn is correct when he says that a tree object to hold the notes of all commits (which is not an unlikely scenario if you are Actually, I do not see much of a problem there. If the entry (corresponding to the commit name) in the notes tree points to a blob, then that is that, if it points to a tree, then we just read all of the objects therein (or maybe at a later stage we allow restricting to a certain file basename). The point you raised earlier, that there would be a lot of ambiguity if we allow both flat and fan-out directory structures, is a valid point, though. Ciao, Dscho --
Sorry I'm getting involved in this notes thing so late. I was way too focused on Gerrit2 and just didn't pay much attention to what was on the git ML recently. Like Dscho and Peff, I think we may A notes tree entry requires 6+1+40+1+20=68 bytes per entry. If I use it for what I want in Gerrit, which is to annotate every commit, on a project like git.git with 17,491 commits we're talking about a tree that is 1.13 MB. That tree grows at a rate of 276 KB/year. I'm not sure I want to think about the cost to unpack that tree, just so I can look at "git log --since=1.week.ago". My fear here is that over time we will be spending a lot of CPU time unpacking and indexing the tree in memory, only to then pull out a handful of recent commits, and then see the pager abort and Yup. The flat vs. fan-out is a problem. In a slightly unrelated thread offlist I have been talking with Sam Vilain about using Git as a database backend for tuple storage. There is a related issue there about making the tree structure consistent, but never stored in a way that we wind up with these massive multi-megabyte objects. We've only started to kick it around, but I think we are both in agreement that a "database tree" is owned by the database code and must not be twiddled manually. Not unless you can honor the formatting rules. Just like you shouldn't use "git hash-object" to create a tree, unless you can honor the basic formatting rules for trees. This also means that the "database trees" probably are not going to be mergeable with a basic merge-recursive sort of algorithm, but instead need specialized handling to perform the combination. I think we're leaning in a direction of something more like this for trees: - Tuples are stored under a path constructed from their primary key. The analog here is, the commit SHA-1 the note is annotating. - Trees are capped at some reasonable size limit. For sake of argument lets call that MAX_TREE. My feeling is this would be closer ...
Hi, The whole point of my exercise was to reuse as much as possible of Git's framework. After all, if you store an index external from Git's object database, you go back to reimplementing the whole infrastructure for fetching/merging just for that index. Ciao, Dscho --
Yea, I know. It might just be easier to abandon everything in Git and start from scratch for the "git database" thing. But we'd lose the ability to at least piggyback onto the existing Git transport. And it doesn't help the "git notes" feature we're talking about. Maybe I was viewing the external index as like the working tree, where you can't really access the data until the external indexes are current, just like you can't really (easily anyway) access the working tree files until you bring the working tree current. But yea, it doesn't really use any of the existing machinary. -- Shawn. --
I think my patch from 1 Feb addressed this, at least for the operations it implemented. I just don't see why you need to decide up front what the split is going to be. Just read the next tree, descend into the closest matching tree until you find the record you are looking for and that's it. Sure, my patch just loads it all and throws it into a hash - this should still be efficient for short log operations even if the hash table ends up 1MB. But why take my guess. Let's stress test it. 'lorem' is the binary in the Text::Lorem Perl module. It generates a paragraph of random Latin text. wilber:~/src/git$ time git-log | wc -l 256072 real 0m0.709s user 0m0.608s sys 0m0.116s wilber:~/src/git$ git rev-list HEAD | wc -l 17678 wilber:~/src/git$ cat > my-editor #!/bin/sh ( lorem; echo ) > $1 wilber:~/src/git$ chmod +x my-editor wilber:~/src/git$ export EDITOR=`pwd`/my-editor wilber:~/src/git$ export GIT_NOTES_SPLIT=2 wilber:~/src/git$ time git-rev-list HEAD | while read rev > do ./git-notes.sh edit $rev; done fatal: unable to create '.git/refs/notes/commits.lock': File exists error: Ref refs/notes/commits is at 5f0732975b4acf237912a31e7ce14aa86d2e8179 but expected 725a2d119d2725e7d821906ad085bfbadbf43c8e fatal: Cannot lock the ref 'refs/notes/commits'. [...] fatal: unable to write new index file Could not read index fatal: unable to write new index file Could not read index fatal: unable to write new index file Could not read index fatal: unable to write new index file Could not read index fatal: unable to write new index file Could not read index real 76m16.927s user 43m55.909s sys 19m33.005s Oo. Nasty errors there but never mind that for now. Obviously some remaining issues in the shell script. What did I get out of that? wilber:~/src/git$ git-ls-tree -r refs/notes/commits | wc 12043 48172 1144085 wilber:~/src/git$ Hey well that's not too bad. Enough to be a good test. How long ...
Great idea! Glad I thought of it! ;-) http://thread.gmane.org/gmane.comp.version-control.git/106715/focus=107975 I hoped my approach allowed for smarter things later, such as splitting into smaller buckets whenever a directory gets more than N entries or periodically rebalancing if required. But the initial version is at least forward thinking to support reading it. Merging them will need to be savvy of this of course. Sam. --
