Roland Stigge recently pointed out a use case using very large files where GIT has some serious limitations. He is one of several Debian developers keeping their homedir under version control with SVN (blame Joey Hess for this - http://www.kitenet.net/~joey/svnhome.html ). SVN does reasonably well tracking his >1GB mbox file. Now, I don't know if I like the idea of putting my own mbox file under version control, but it looks like projects with large and slow-changing files would be in trouble with GIT. Not literally trouble, but gross inefficiencies. The problems are two. At commit time, a full copy is stored in the object database until git-repack && git-prune-packed are called. And during the transfer over the git protocol we send the full object, even if both ends have objects that are good candidates for a small delta. I'm not strong on either aspect of git (packfile format or git protocol), and I don't personally deal with large files. So feel free to ignore me for the time being. If it ever itches, you might get a patch... cheers, martin -
To my surprise, it's not that bad. The Debian testing-security team uses a single 1.8 MB file (400 KB compressed) to keep vulnerability data. Most changes to that file involve just a few lines. But even in this extreme case, git doesn't compare too badly against Subversion if you pack regularly (but not too often). Disk usage is actually *below* Subversion FSFS even with --depth=10 (the default, unfortunately a bit hard to override). I plan to do another experiment for GCC, which contains marvels such as: 35905 126056 1379093 gcc/ChangeLog-2005 12610 61215 417584 gcc/combine.c But the outcome will likely be quite similar to the secure-testing case: comparable disk space usage, not a difference in the order of one or more magnitudes. But Subversion still has got a significant adventage: I can get a working copy without downloading full history (several gigabytes in GCC's case). There's also the slight drawback that you shouldn't pack too often, otherwise you'll reduce its effectiveness. You can always run "git-repack -a -d", but it's rather expensive. This means that you need to keep compressed fulltexts from a few dozen revisions, but I don't think this is a huge burden. All in all, the compressed fulltexts/packs model is a pretty good trade-off between disk usage, end user usability nad code complexity. In your mbox case, you should simply try Maildir. The tree object (which lists all files in the Maildir folder) will still be rather large (about 40 to 50 bytes per message stored), though. -
I have a project that has 2.5Mb files, and git handles them just fine, even on my old slow laptop. But when I tried to use it to backup my old email archive a few months ago, running about 2Gb in about 300 files, it took forever. Luckily I only archive stuff off every other month or so, otherwise it would be unusable. However, I did notice one problem. When cloning from one machine to another, for a project that is already fully packed, it seems that the whole project is packed again before sending it accross the wire. With an archive this big, that takes over an hour for my slow old fileserver. I ended up just rsyncing over the files and pointing the parent back to the original. Is there anyway to not repack everything if it's not needed? thanks, greg k-h -
I often cg-clone large projects over stupid protocols (http/rsync) and then hand-edit the branches/origin or remotes/origin file to say "git://" instead. It's called cheating but it works great ;-) m -
I did suggest maildir, where GIT is bound to do well as the content of the emails doesn't change but they just move around a lot. Though yes, trees are going to be nasty. But the interesting case I gues is the general one of large files changing slowly. My guess is that supporting delta transfers in the git protocol would make it a lot more manageable. For local storage git isn't so bad, and the problem is perhaps harder to resolve. cheers, martin -
I've been keeping maildir in git for a few months, with mail being delivered into a git repo on one (permanently connected) host and me merging that branch into a repo on my laptop for reading (the intention being that I should be able to sync it back to the permanently connected host as I sometimes read mail there. Alas, the merge part of this absolutely sucks -- as time goes by, its getting slower and slower (its taking an hour or so to do the merge, which has got to the point of being barely usable -- if it wasn't for the neat hack-value, I'd have given up on this by now). I haven't really probed whats happening when I'm doing the merges in any depth, but I see a lot of index manipulation happening (git update-index, I think) to add and remove files where each invocation of that seems to be taking almost a whole second. I wonder if the present merge algorithms perform especially badly in the case of a large number of files with lots of renames (and so lots of adds/removes) but no content changes? The merge should be able to happen entirely in the index, I think. Perhaps one way to proceed would be for me to write a move-optimised merge strategy where I flip the index round and instead of saying "how has the content inside this filename changed?" I instead say "how has the filename associated with this content <hash> changed?" A special-case on top of a move-optimised merge might be some maildir-aware filename handling that knows how to resolve conflicts when a particular content-hash has been renamed to two different names (eg. when one flag is added to a message in one repo and a different flag is added to a message in another repo). Any advice/thoughts/suggestions-that-this-is-a-stupid-thing-to-do would be greatly appreciated. -- -
How many files are you talking about? m -
If it takes an hour per merge, it _is_ unusable. I consider 15 _seconds_ to be pretty unusable. Can you do a git-ls-tree -r -t HEAD git-ls-tree -r -t HEAD^1 git-ls-tree -r -t HEAD^2 after a merge, and put the three resulting files up somewhere public (I assume the filenames aren't going to be anything private, I don't know how maildir organizes stuff) so that people can get an idea of what ends up being involved there.. Linus -
Btw, one thing to realize is that git is inherently a lot better at handling lots of files in _subdirectories_, especially if those subdirectories don't change. I've never used maildir layout, but if it is a couple of large _flat_ subdirectories, git will potentially handle that a lot worse than if you have a hierarchy of directories. I say "potentially", because if the directories are all mutable and change, then the flat approach is better. But if they tend to have some kind of stability, a lot of git operations (diffing and merging in particular) are able to see that two subdirectories are 100% equal, and will entirely skip them. This is a large part of why git performs well on the kernel. Most merges don't actually touch all - or even a very big percentage - of the over thousand subdirectories in the kernel. Git can quickly see and ignore the whole subdirectory when that happens - the SHA1 is exactly the same, so git knows that every file under that subdirectory (and every recursive directory) is the same. In contrast, if you have a million files in one directory, and 10 of them changed, git will still have to check the SHA1's for matches for the other 999,990 files. Which is going to be slow. That said, I suspect there's space for optimization. Linus -
That's what it is :/ One directory per mail folder, with each email an individual file in that dir. On the side I run a mini-ISP that offers (among other things) SMTP/IMAP/POP via exim/dovecot. I use maildir for my customers, and am desperately searching for a better solution. David Woodhouse and I talk off and on about using git for distributed mail storage. OTOH, I wonder if a P2P-ish, sha1-indexed network service wouldn't be better for both a git fallback and email storage. Jeff -
Ok. Anyway, I double-checked, and I'm wrong anyway. While the "static directories" thing is a huge performance optimization for doing many things (diffing trees, file history in git-rev-list, etc etc), for merging it doesn't help. We always end up expanding the whole tree. Which is kind of sad. It's inevitable in one sense: we do the merge in the index, after all, and the index - unlike the tree structures - is a flat file (like the "manifest" in mercurial or monotone). It's also represented that way in memory. However, it is a total and complete waste in other cases. Thinking more about it, this is also why merging causes all the horrible index performance: not only do we (unnecessarily) read the same trees over and over again only to collapse them back to stage0 later when they are the same, but because we keep the index in a linear format, when we read the other trees, we'll have to move things around with memmove() (just the pointers, but still). We'd actually be a _lot_ better off if we split "git-read-tree" up into two phases: one that did the recursive tree operation (which can optimize the "same tree everywhere" case), and the second stage that actually populated the index. I'll have to think about this. It would be an absolutely _huge_ optimization for merging in certain patterns, it just doesn't matter for something like the kernel with "just" 18,000 files and not a lot of strange merging going on. In contrast, I can see a mail archive easily having hundreds of thousands of individual emails. At which time it's horribly stupid to read them all in three times (for a merge - base, origin, new) and do so in a pretty inefficient manner. Ho humm. It doesn't look _hard_ per se, and I think the two-stage git-read-tree is actually also what the recursive merge strategy wants anyway (it can't use the index - it really just wants to get a list of conflict information). So this definitely sounds like the RightThing(tm) to do ...
and, named to include a hash of the contents so that they are always unique within a multi-folder mail store (makes refile easier). --=20 keith.packard@intel.com
really? I thought it was something like md5(hostname+datestamp+random). cheers, m -
Final note: this means, for example, that git is relatively bad at tracking a "hashed" nested file directory (like the one git itself uses), because new files will end up randomly appearing in every directory, and no directory is ever "stable". In contrast, if the directory structure is - for example - something where you index files by date, and subdirectories with older dates are thus much more naturally likely to be quiescent, the "this tree is the same" optimizations work very well. Basically, a lot of the git speed optimizations depend on "on average, things stay the same". We may have 18,000+ files in the kernel, but most patches will change maybe five of them. There's a lot of fairly static content and the changes have a certain level of "locality". It's normally a hundred-line patch to one file, not a hundred files that had one-liners. And when 20 files are changed, most of them tend to be in the same subdirectory, etc etc. Taking advantage of those kinds of things is what makes git good at handling software projects. But it wouldn't necessarily be how you lay out a mail directory, for example. An automated file store might want to spread out the changes on purpose. Linus -
Indeed... Im curious as to why anyone would want to use a SCM tool on a mail dir anyway - surely no-one edits their pasnt mails and needs to keep logs? surely incremental backups would be a better way to manage something like this ? -
Well, the maildir arrangement is friendlier to something content-smart like git than to something content-stupid like a backup tool. Files inside a maildir change name/location to reflect changes in status, but their content tends to remain the same. git does great in this scenario, except for the "dealing with a bazillion files" part of it. cheers, martin -
Hi, Point is, if you want to read your email on different computers (like one desktop and one laptop), you are quite well off managing them with git. Of course, you could rsync them from/to the other computer. But rsync is slow once you accumulated enough files, since it has to compare the hashes of tons of files (or file chunks). Git knows if they have changed. Hth, Dscho -
Yes. I actually think that git would be a _wonderful_ email tracking tool, but that may not mean that it's a wonderful tool for tracking all particular email layout possibilities. It clearly is _not_ a wonderful tool for tracking mbox-style email setups, for example ;) I suspect we actually could make the "one linear directory" setup perform pretty well. It wouldn't be the best possible layout (by far), but I think our problems there are just because of some decisions we've (me, mostly) made that didn't take that layout into account. I don't think the problems are in any way fundamental. That said, I think git could do much better if the layout was optimized for git. For example, in the maildir thing, there's two issues: the flat directory structure is sub-optimal, but the other thing seems to be that maildir apparently saves metadata in the filename. Saving meta-data in the filename should actually work wonderfully well with git, but both merging and git-diff-tree consider the filename to be the "index", so they optimize for that. You could do indexing the other way around, and consider the contents to be the index (and the filename is the "status"), but that's obviously not sane for a sw project, even if it might be exactly what you want to do for mail handling. Linus -
This seems to me to be another use case where git could gain orders of magnitude speed improvement by either explicit ("forensic") history objects, or a history analysis cache. Sam. -
Well, the thing is, it could get that _without_ any cache too. The problem really isn't that we couldn't make things faster, the problem is that at least for _me_ the thing is fast enough. If somebody is interested in making the "lots of filename changes" case go fast, I'd be more than happy to walk them through what they'd need to change. I'm just not horribly motivated to do it myself. Hint, hint. Linus -
In case anybody is wondering, I share the same feeling. I cannot say I'd be "more than happy to" clean up potential breakages during the development of such changes, but if the change eventually would help certain use cases, I can be persuaded to help debugging such a mess ;-). -
Excellent. Any speculations on where they might fit? Clearly, it needs
to be out of the "tree".
Dealing with the three cases I mentioned before in my Warnocked post;
1. caching - I'll consider this an "under the hood" thing, it really
doesn't matter, so long as the tools all know.
2. forensic - extra stuff at the end of the commit object?
eg
Copied: /new/path from /old/path:commit:c0bb171d..
(for SVN case where history matters)
Copied: /new/path from blob:b10b1d..
(for general pre-caching case)
Merged: /new/path from /old/path:commit:C0bb171d..
(for an SVK clone, so we know that subsequent merges on
/new/path need only merge from /old/path starting at commit
C0bb171d..)
3. retrospective - as above, but allow to specify old versions.
eg
Copied: /new/path:C0bb171d1 from /old/path:commit:c0bb171d2...
(for SVN case where history matters)
Martin, is that enough for your CVS case?
Sam.
-
(except "extra at the end of commit", which does not make it out
I am not sure if recording the bare SVN ``copied'' is very
useful. You would need to infer things from what SVN did to
tell if the copy is a tree copy inside a project (e.g. cp -r
i386 x86_64), tagging (e.g. svn-cp rHEAD trunk tags/v1.2), or
branching, wouldn't you? SVK merge ticket is a bit more useful
in that sense.
So far, git philosophy is to record things you _know_ about and
defer such guesswork to the future, so limiting what you record
to what you can actually see from the foreign SCM would be more
in line with it. For the same reason, if you are talking about
maildir managed under git, you should not have record anything
other than what git already records: "we used to have these
files, now we have these instead".
But I thought you were talking about caching what earlier
inference declared what happened, so that you do not have to do
the same inference every time. If that is the case, SVN level
"Copied:" is probably not what you would want to record, I
suspect. You would do some inference with the given information
("SVN says it copied this tree to that tree, what was it that it
really wanted to do? Was it a copy, or was it to create a
branch which was implemented as a copy?"), and record that,
hoping that information would help your other operations this
time and later.
So I think the order of questions you should be asking is:
- what operations are you trying to help?
- what information you would need to achieve those operations
better?
- among the second one, what will be necessary to be set in
stone (IOW, cannot be computed later), and what are
computable but expensive to recompute every time?
An example from an ancient thread.
With criss-cross merge between renamed trees, it was conjectured
that recording renames detected earlier would help later merges.
I think you should arrive at the list of "what we should record"
by thinking things in this ...Primarily, tracing history when dealing with history/changeset based
revision systems like SVN or darcs, and doing this in a manner that we
can make guarantees about behaving in the same way as those systems
Minimally, this tuple:
( merge|copy, source_path, source_tree|source_commit,
destination_path, destination_commit )
It makes sense to record this with commits, as conceptually it is a part
The only operation you cannot automatically and with certainty detect a
rename and change content without inserting a dummy commit between the
name change and the content change. But in a sense this is the same as
my suggestion - using the commit object history to record information
that normally doesn't matter when you are doing content-keyed
Well, thank you for spending so much time to reply to me given that was
your assessment. I think the best direction from here would be to start
molding some porcelain, then I can cross this bridge when I come to it
rather than simply speculating and hand-waving.
Besides, I can always prototype it for discussion using the commit
description as a surrogate container for the information.
Sam.
ps I also responded to the rest of your e-mail, but decided that the
answers to the above questions were more important.
>> 2. forensic - extra stuff at the end of the commit object?
> (except "extra at the end of commit", which does not make it out
> of the tree).
It is a part of the repository, but more a property of the commit itself
- like the commit description. Like somebody writing "I renamed this
file to that file and changed its contents", but in a parsable form
that can _optionally_ be used to prevent the relevant git-core tools
from having to do content comparison, or perhaps something subtler like
increasing the score of the recorded history branch when scoring
alternatives looking for history.
>> eg
>> Copied: /new/path from /old/path:commit:c0bb171d..
>> (for SVN case where history ...I think Junio & Linus are talking about alternative mergers, something that can be called instead of git-read-tree -m (which is the way merges seem to kick off). Or perhaps an additional flag to git-read-tree to be used in conjunction with -m, something like --optimize-for-identity that lets git-read-tree know to do a first pass keying things on file identity rather than file path. So we are _not_ touching the object database, at all. Only optimising merges for very large trees there mv is a popular operation. All the cases you discuss can be tackled very efficiently without making *any* Oh, I don't need it at all. It's just that there's been some lazy talk of tracking mboxes and maildirs with git, and look where it's led. Blame Roland Stigge who got me started down this track. I'm sure it's because the other optimisations are a lot harder to tackle ;-) though Linus mentions that it'd be trivial for git-read-tree -m to detect unchanged directories and perhaps do things a bit faster. Not as revolutionary as an --optimize-for-identity but not as risky either. In any case, don't count in me for any of this git-checkout hacking. I know better than start learning C posting patches to *this* list. cheers, martin -
Actually, I got interested in seeing how hard this is, and wrote a simple first cut at doing a tree-optimized merger. Let me shout a bit first: THIS IS WORKING CODE, BUT BE CAREFUL: IT'S A TECHNOLOGY DEMONSTRATION RATHER THAN THE FINAL PRODUCT! With that out of the way, let me descibe what this does (and then describe the missing parts). This is basically a three-way merge that works entirely on the "tree" level, rather than on the index. A lot of the _concepts_ are the same, though, and if you're familiar with the results of an index merge, some of the output will make more sense. You give it three trees: the base tree (tree 0), and the two branches to be merged (tree 1 and tree 2 respectively). It will then walk these three trees, and resolve them as it goes along. The interesting part is: - it can resolve whole sub-directories in one go, without actually even looking recursively at them. A whole subdirectory will resolve the same way as any individual files will (although that may need some modification, see later). - if it has a "content conflict", for subdirectories that means "try to do a recursive tree merge", while for non-subdirectories it's just a content conflict and we'll output the stage 1/2/3 information. - a successful merge will output a single stage 0 ("merged") entry, potentially for a whole subdirectory. - it outputs all the resolve information on stdout, so something like the recursive resolver can pretty easily parse it all. Now, the caveats: - we probably need to be more careful about subdirectory resolves. The trivial case (both branches have the exact same subdirectory) is a trivial resolve, but the other cases ("branch1 matches base, branch2 is different" probably can't be silently just resolved to the "branch2" subdirectory state, since it might involve renames into - or out of - that subdirectory) - we do not track the current index file at all, so this does not do ...
Here, btw, is the trivial diff to turn my previous "tree-resolve" into a
"resolve tree relative to the current branch".
In particular, it makes the example merge perhaps even more interesting,
and makes the "merging directories and merging files should use different
heuristics more obvious". It's quite instructive, I think.
So if you want to test this, the merge I have been testing with is the
last infiniband merge in the kernel:
git-merge-tree 3c3b809 4cbf876 7d2babc
and you'll need to spend a few moments on thinking about what the
"directory merge" thing there means: in particular, we should probably
make the
if (entry[2].sha1) {
test be
if (entry[2].sha && !S_ISDIR(entry[2].mode)) {
(and same for "resolve to entry[1]" case for that matter) so that we never
create a "resolve()" that picks a whole subdirectory from one of the
branches.
The current logic is "logical", just probably not what we want.
Linus
----
diff --git a/merge-tree.c b/merge-tree.c
index 0d6d434..0bf871c 100644
--- a/merge-tree.c
+++ b/merge-tree.c
@@ -55,9 +55,19 @@ static int same_entry(struct name_entry
a->mode == b->mode;
}
-static void resolve(const char *base, struct name_entry *result)
+static void resolve(const char *base, struct name_entry *branch1, struct name_entry *result)
{
- printf("0 %06o %s %s%s\n", result->mode, sha1_to_hex(result->sha1), base, result->path);
+ char branch1_sha1[50];
+
+ /* If it's already branch1, don't bother showing it */
+ if (!branch1)
+ return;
+ memcpy(branch1_sha1, sha1_to_hex(branch1->sha1), 41);
+
+ printf("0 %06o->%06o %s->%s %s%s\n",
+ branch1->mode, result->mode,
+ branch1_sha1, sha1_to_hex(result->sha1),
+ base, result->path);
}
static int unresolved_directory(const char *base, struct name_entry n[3])
@@ -183,21 +193,21 @@ static void merge_trees(struct tree_desc
/* Same in both? */
if (same_entry(entry+1, entry+2)) {
if (entry[0].sha1) {
- resolve(base, ...Gaah. It was trivial, and it happened to work fine for my test-case, but
when I started looking at not doing that extremely aggressive subdirectory
merging, that showed a few other issues...
So in case people want to try, here's a third patch. Oh, and it's against
my _original_ path, not incremental to the middle one (ie both patches two
and three are against patch #1, it's not a nice series).
Now I'm really done, and won't be sending out any more patches today.
Sorry for the noise.
Linus
----
diff --git a/merge-tree.c b/merge-tree.c
index 0d6d434..6381118 100644
--- a/merge-tree.c
+++ b/merge-tree.c
@@ -55,9 +55,26 @@ static int same_entry(struct name_entry
a->mode == b->mode;
}
-static void resolve(const char *base, struct name_entry *result)
+static const char *sha1_to_hex_zero(const unsigned char *sha1)
{
- printf("0 %06o %s %s%s\n", result->mode, sha1_to_hex(result->sha1), base, result->path);
+ if (sha1)
+ return sha1_to_hex(sha1);
+ return "0000000000000000000000000000000000000000";
+}
+
+static void resolve(const char *base, struct name_entry *branch1, struct name_entry *result)
+{
+ char branch1_sha1[50];
+
+ /* If it's already branch1, don't bother showing it */
+ if (!branch1)
+ return;
+ memcpy(branch1_sha1, sha1_to_hex_zero(branch1->sha1), 41);
+
+ printf("0 %06o->%06o %s->%s %s%s\n",
+ branch1->mode, result->mode,
+ branch1_sha1, sha1_to_hex_zero(result->sha1),
+ base, result->path);
}
static int unresolved_directory(const char *base, struct name_entry n[3])
@@ -100,9 +117,12 @@ static void unresolved(const char *base,
{
if (unresolved_directory(base, n))
return;
- printf("1 %06o %s %s%s\n", n[0].mode, sha1_to_hex(n[0].sha1), base, n[0].path);
- printf("2 %06o %s %s%s\n", n[1].mode, sha1_to_hex(n[1].sha1), base, n[1].path);
- printf("3 %06o %s %s%s\n", n[2].mode, sha1_to_hex(n[2].sha1), base, n[2].path);
+ if (n[0].sha1)
+ printf("1 %06o %s %s%s\n", n[0].mode, sha1_to_hex(n[0].sha1), base, ...Still true. I've just been thinking about the last state. As far as I can tell, the output from git-merge-tree with that fix to only simplify subdirectories that match exactly in all of base/branch1/branch2 is precisely the output that git-merge-recursive actually wants. Rather than doing a three-way merge with "git-read-tree", and then doing "git-ls-files --unmerged", I think this gives the same result much more efficiently. That said, I can't follow the python code, so maybe I'm missing something. Fredrik cc'd, in case he can put me right. Linus -
The matches the recollection I had last time I mucked with the
code. Currently it is set up to do one path at a time in both
index and working tree, so it would not be a trivial rewrite,
but merge-tree based approach would speed things up quite a
bit.
I was thinking about implementing mergers as a pipeline:
git-merge-tree O A B |
git-merge-renaming A |
git-merge-aggressive A |
git-merge-filemerge
git-merge-tree (yours) does not do trivial collapsing, and
produce raw-diff from A. git-merge-renaming reads it, finds
copied/renamed entries (maybe reusing parts of diffcore), and
writes out the results in the same format as merge-tree output
(that's why I am giving A on the command line -- so it can also
read A if it wanted to. it may need to talk about what a path in
A was even when merge-tree did not say anything about that
path). Then git-merge-aggressive (bad naming, I know, it only
corresponds to the flag of the same name in read-tree) will
collapse git-merge-one-file equivalent stage collapsing. The
remainder is fed to file-level merger for postprocessing.
Everything except the last step would work on a data format that
merge-tree outputs.
-
(It does _truly_ trivial collapsing, but I think we both agree: it doesn't I was considering perhaps doing a first cut at that in git-merge-tree already. Not sure. One issue is that I think I may have to change the output format if I do that. I should anyway. Why? It's hard to see where "one event" stops, and another starts. I stupidly initially thought that you can do it entirely based on looking at the numbers, but you can't. Right now you have to look at the pathname too, which is kind of sad, and doesn't work after rename detection (since then the pathnames won't be sorted any more, and one "event" can have different pathnames in different stages). [ Side note: it doesn't even work for file/directory conflicts, which can have the same name, but are two different "events". So you'd actually have to look at both mode _and_ filename to sort out if two lines that start with "1" and "3" respectively are one event (removal in first branch) or two events ("1" on one file: removal in both branches + "3" on another file: add in second branch) ] So to do the rename output, you can't use the same format as merge-tree uses right _now_. We'd have to add a marker to mark what the event boundaries are. The "mark" could be a running "event number", or even as easy as an alternating character ("#" vs " " as the second character in the line or similar) So instead of 2 100644 ff280e2e1613e808e4d7844376134dfa2bb1fc21 Documentation/cputopology.txt 2 100644 28c5b7d1eb90f0ccd8e0307c170f89bd7954dc9c Documentation/hwmon/f71805f 1 100644 b88953dfd58022aef1680c266c7438605b146fc8 Documentation/i2c/busses/i2c-sis69x 3 100644 b88953dfd58022aef1680c266c7438605b146fc8 Documentation/i2c/busses/i2c-sis69x 2 100644 00a009b977e92b1a942d1138afdccf1b725df956 Documentation/i2c/busses/i2c-sis96x 2 100644 90a5e9e5bef1daa9d0f0621e209827f0d180f384 Documentation/unshare.txt 2 100644 5127f39fa9bf9a384a6529c6d5deb1002e945de5 ...
Btw, some actual numbers: I did the recent kernel networking merge (which is a trivial in-index merge) with the standard three-way git-read-tree -m <base> <branch> <branch> and with the new git-merge-tree to compare performance. Doing git-read-tree takes ~0.35s, while git-merge-tree took 0.015s. Now, that's not a really fair comparison, because the end result is very different: the git-read-tree has populated the index, ready for a git-writet-ree, while the git-merge-tree has not. However, the interesting part is that especially for a trivial merge, we don't actually _want_ to necessarily populate the index, because doing a "git-write-tree" is actually a pretty expensive operation (on the kernel, it will try to write 1000+ directory trees, most of which already exist. Admittedly we don't actually have to write the objects, since we figure out that they already exist, but we have to do the SHA1 calculations to do so). So if we made the git-merge-tree based merge work entirely on trees all the way, and never even necessarily populate the index at all (unless it has to, due to actual data conflicts that want to be fixed up), that would actually be another performance advantage. The only downside there is that we would literally have to write the resulting tree objects by hand (ie we'd need a new helper for doing that, and another thing to validate). Anyway, that should almost certainly make it possible to scale up git merges to hundreds of thousands of files without huge performance problems (still, that depends a bit on layout - again, flat directory structures won't scale as well, so it might not be enough for maildir handling). But just at a guess, I think there's at least an order of magnitude to be had there. So if a maildir merge currently takes an hour, at least we should be able to get it down to a few minutes. Ben, are you interested in trying this out in your maildir experiments? Linus -
Btw, here's one last gasp on this thread: it generalizes the notion of traversing several trees in sync, which could be used to do the n-way diff for the "-c" and "--cc" style merge diffs a lot more efficiently. I didn't check, but I'm pretty sure that this would bring the cost of doing the 12-way diff down to way under a second. Right now: [torvalds@g5 linux]$ time git-diff-tree -c 9fdb62a > /dev/null real 0m1.279s user 0m1.272s sys 0m0.008s and that's a bit too much. We I'd really have expected us to be able to do better. It should be possible to do this as a traverse_trees(12, &trees, "", combined_diff_callback); fairly cheaply (and quickly throw away anything where any of the parents was the same as the result). Junio, that "traverse_trees()" logic is totally independent of whether we actually do "git-merge-tree" or not, so if you want to, I could split up the patches the other way (and merge "traverse_trees()" first as a new interface, independently). Linus ---- git-merge-tree: generalize the "traverse <n> trees in sync" functionality It's actually very useful for other things too. Notably, we could do the combined diff a lot more efficiently with this. Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff --git a/merge-tree.c b/merge-tree.c index 6381118..2a9a013 100644 --- a/merge-tree.c +++ b/merge-tree.c @@ -125,44 +125,19 @@ static void unresolved(const char *base, printf("3 %06o %s %s%s\n", n[2].mode, sha1_to_hex(n[2].sha1), base, n[2].path); } -/* - * Merge two trees together (t[1] and t[2]), using a common base (t[0]) - * as the origin. - * - * This walks the (sorted) trees in lock-step, checking every possible - * name. Note that directories automatically sort differently from other - * files (see "base_name_compare"), so you'll never see file/directory - * conflicts, because they won't ever compare the same. - * - * IOW, if a directory changes to a filename, it will automatically be - * ...
I won't have time to look at the actual patch tonight but I am interested. I think the general idea should work nice with both multi-base and octopus merge cases as well ;-). -
I don't think you miss anything. I _think_ (I haven't looked at this too close yet) that it shouldn't be too much work to make git-merge-recursive make use of the git-merge-tree thing. - Fredrik -
Hi, That is intentional: git handles source code very well, where you tend to have small files, and it handles branches very well, where you tend to have mostly the same files in different branches. I am uncertain if it is possible to extend git to handle large files gracefully, without slowing it down for its main use case. [thinking] A potentially silly idea just hit me: We could virtually cut every file into 256kB chunks. That would not affect source code at all: anybody producing a 256kB C file should be shot anyway. If the files just keep growing, this should help enormously. If the files change subtly, the diff algorithm should work quite well on 'em. Comments? Ciao, Dscho -
Indeed. The git architecture simply sucks for big objects. It was discussed somewhat durign the early stages, but a lot of it really is pretty fundamental. The fact that all the operations work on a full object, and the delta's are (on purpose) just a very specific and limited It probably wouldn't help that much, really. And it would probably impact source code users too: I bet we'd have bugs. It would be a very strange special case. It also would only help for things that purely grow at the end. Which isn't even true for a mailbox: it may or may not be true for your INBOX, but anybody who _uses_ a mailbox format to read his email will be adding status flags to the mbox format (or deleting mbox entries etc). So every time a small change happened that changed the offset, you'd have an explosion of these 256kB chunk objects, and while the delta would work (probably slowly - remember how the git deltification algorithm tries to compare against the ten "nearest" neighbors), at _commit_ time you'd have to write that 1GB (compressed) out anyway. Realistically, I think the answer is that git just doesn't work for his usage case. There's two alternatives: - convince him to not have big mailboxes (an answer I don't particularly like: it's a tool limitation, and you shouldn't change your behaviour just because the tool doesn't work for it - you should just try to find the right tool). That said: git should actually work beautifully for email if you _don't_ keep it as one big mbox. You could probably very reasonably use git as a database backend, where each email is its own object, and you can have many different ways of indexing them into trees (by content, by date, by author, by thread). But that's very different from the suggested "home directory" setup would be. - try to work around some of the worst git issues. While I don't think the 256kB blockign thing would help (the git protocol would still ...
Side note: the original explicit git "delta" objects by Nicolas Pitre would have handled this large-file-case much more gracefully. The pack-files had absolutely huge advantages, though, so I think we (I) did the right thing there in making the delta code only a very specific special case.. It is possible that we could re-introduce the "explicit delta" object, though (it's not incompatible with also doing pack-files, it's just that pack-files made 99% of all the arguments for an explicit delta go away). Linus -
I do not remember we had 'rev-list --objects' support for Nico's explicit delta object chains. If we didn't that would be a new development that needs to be done to resurrect it. I know pack-objects never had support for it so obviously that needs to be added as well. Probably explicit delta objects should always be packed in full without spending cost to find delta candidates. Personally I feel that post-1.2.0 would be a good time to start looking at enhancing the pack generation chain, rev-list piped to pack-objects. This "large files" use case is helped by less self-contained packs while "shallow clone" use case we discussed earlier is helped by more self-contained packs (we had a discussion long time ago on this and I think we have the code to do so [*1*]). An addition to pack-objects is needed to make it capable to read a list of objects that we do not want to include in the resulting pack but can be used as base objects for delitified. BTW, as to the "shallow clone", I changed my mind and am inclined to agree with Johannes that handling cut-offs differently from grafts is easier for dealing with later "give me more history" operation, so I am planning to chuck my jc/clone topic branch that I have included in the proposed updates so far. [Footnote] *1* http://article.gmane.org/gmane.comp.version-control.git/5779 -
