On the gcc repository (which is now a 234 meg pack for me), git annotate ChangeLog takes > 800 meg of memory (I stopped it at about 1.6 gig, since it started swapping my machine). I assume it will run out of memory. I stopped it after 2 minutes. Mercurial, on the same file, takes 50 meg and 30 seconds. git annotate fold-const.c takes 300 meg of memory and takes > 30 seconds. Mercurial, on the same file takes 50 meg of memory and 10 seconds. svn takes 15 seconds and 20 meg of memory. I have excluded the mmap memory from mmap'ing the pack/file (in git/mercurial respectively). Annotate is treasured by gcc developers (this was a key sticking point in svn conversion). Having an annotate that is 2x slower and takes 15x memory would not fly (regardless of how good the results are). This seems to be a common problem with git. It seems to use a lot of memory to perform common operations on the gcc repository (even though it is faster in some cases than hg). --Dan -
The thing is, git has a very different notion of "common operations" than you do. To git, "git annotate" is just about the *last* thing you ever want to do. It's not a common operation, it's a "last resort" operation. In git, the whole workflow is designed for "git log -p <pathnamepattern>" rather than annotate/blame. In fact, we didn't support annotate at all for the first year or so of git. The reason for git being relatively slow is exactly that git doesn't have "file history" at all, and only tracks full snapshots. So "git blame" is really a very complex operation that basically looks at the global history (because nothing else exists) and will basically generate a totally different "view" of local history from that one. The disadvantage is that it's much slower and much more costly than just having a local history view to begin with. However, the absolutely *huge* advantage is that it isn't then limited to local history. So where git shines is when you actually use the global history, and do merges or when you track more than one file (which others find hard, but git finds much more natural). An examples of this is content that actually comes from multiple files. File-based systems simply cannot do this at all. They aren't just slower, they are totally unable to do it sanely. For git, it's all the same: it never really cares about file boundaries in the first place. The other example is doing things like "git log -p drivers/char", where you don't ask for the log of a single file, but a general file pattern, and get (still atomic!) commits as the result. And perhaps the best example is just tracking code when you have two files that merge into one (possibly because the "same" file was created independently in two different branches). git gets things like that right without even thinking about it. Others tend to just flounder about and can't do anything at all about it. That said, I'll see if I can speed up "git blame" on the ...
I understand this, and completely agree with you. However, I cannot force GCC people to adopt completely new workflow in this regard. The changelog's are not useful enough (and we've had huge fights over this) to do git log -p and figure out the info we want. Looking through thousands of diffs to find the one that happened to your line is also pretty annoying. Annotate is a major use for gcc developers as a result SVN had the same problem (the file retrieval was the most expensive op on FSFS). One of the things i did to speed it up tremendously was to do the annotate from newest to oldest (IE in reverse), and stop annotating when we had come up with annotate info for all the lines. If you can't speed up file retrieval itself, you can make it need less files :) In GCC history, it is likely you will be able to cut off at least 30% of the time if you do this, because files often have changed entirely multiple times. -
Unfortunately, we're doing that already. One improvement that is already available is that we can do progressive annotate: we can output lines we find in the order we find them, such that lines that changed recently (which are usually the more interesting ones) get annotated quicker. Obviously, you need a GUI-ish thing to do this, because pagers don't like having stuff written out of order, but there's a good chance that a user annotating fold-const.c will have the info for the interesting lines in a few seconds, and go on while git is still trying to find where the boring old lines came from. There's also the possibility of generating caches of commit:file pairs you've annotated, which would make generating the annotation for something you'd annotated for a recent commit blindingly fast. -Daniel *This .sig left intentionally blank* -
If the question you want to answer is "what happened to that line" then using git annotate is using a big hammer for no good reason. git log -S'<put the content of the line here>' -- path/to/file.c will give you the very same answer, pointing you to the changes that added or removed that line directly. It's not a fast command either, but it should be less resource hungry than annotate that has to do roughly the same for all lines whereas you're interested in one only. The direct plus here, is that git log output is incremental, so you have answers about the first diffs quite quick, which let you examine the first answers while the rest is still being computed. Unlike git annotate, this also allow you to restrict the revisions where it searches to a range where you know this happened, which makes it almost instantaneous in most cases. Of course, if the line is ' free(p);\n' then you will probably have quite a few false positives, but with the path restriction, I assume this will still be quite accurate. What is important here is to know what is the real question the GCC programmers want to answer to. It seems to me that `blame` is an overkill for the underlying issue. Note that it does not justifies the current memory consumption that just looks bad and wrong to me, but this aims at finding a way to answer your question doing just what you need to answer it and not gazillions of other things :) --=20 =C2=B7O=C2=B7 Pierre Habouzit =C2=B7=C2=B7O madcoder@debia= n.org OOO http://www.madism.org
Yes, but blame also takes revision bottoms (obviously you have to start digging from a single revision so "blame master..next pu" would not You can feed more than a line from -S, and the assumed and recommended Right. -
Oh, I agree. It's why we do have "git blame" these days, and it's why I've tried to make people use the nicer incremental mode, which is not at all faster, but it's a hell of a lot more pleasant to use because you get some output immediately. In other words, git blame gcc/ChangeLog is virtually useless because it's too expensive, but try doing git gui blame gcc ChangeLog instead, and doesn't that just seem nicer? (*) The difference is that the GUI one does it incrementally, and doesn't have to get _all_ the results before it can start reporting blame. Not that I claim that the gui blame is perfect either (I dunno why it delays the nice coloring so long, for example), but it was something I pushed - and others made the gui for - exactly to help people with the We do that. The expense for git is that we don't do the revisions as a single file at all. We'll look through each commit, check whether the "gcc" directory changed, if it did, we'll go into it, and check whether the "ChangeLog" file changed - and if it did, we'll actually diff it Not gcc/ChangeLog, though (apart from the renames that happen occasionally). Btw, an example of something git *should* do right, but is just too damn expensive, is doing git gui blame gcc/ChangeLog-2000 and have it actually be able to track the original source of each of those annotations across that "ChangeLog split from hell". I bet it would eventually get it right, but that's a large file, way back in history, and it will try to do a non-whitespace blame with copy detection. That's *expensive*, although it is an amusing thing to try to do ;) Linus PS. I also do agree that we seem to use an excessive amount of memory there. As to whether it's the same issue or not, I'd not go as far as Nico and say "yes" yet. But it's interesting. It's not entirely surprising that we see multiple issues with the gcc repo, simply because it's not the kind of repo that people have ever really ...
And, btw: the diff is totally different from the xdelta we have, so even if we have an already prepared nice xdelta between the two versions, we'll end up re-generating the files in full, and then do a diff on the end result. Of course, part of that is that git logically *never* works with deltas, except in the actual code-paths that generate objects (or generate packs, of course). So even if we had used a delta algorithm that would be amenable to be turned into a diff directly, it would have been a layering violation to actually do that. Other systems can sometimes just re-use their deltas to generate the diffs and/or blame information. I dunno whether SVN does that. CVS does, afaik. Linus -
CVS does because it's delta is line based, so it's easy. You theroetically can generate blame info from SVN/GIT's block deltas, but you of course, have the problem GIT does, which is that the delta is not meant to represent the actual changes that occurred, but instead, the smallest way to reconstruct data x from data y. This only sometimes has any relation to how the file actually changed -
Exactly. Git objects in themselves have no history or relationships, and being a delta against another object means nothing at all except for the fact that the data seems to resemble that other object (which has a _correlation_ with being related, but nothign more). Anyway, I think the git annotate memory usage was simpyl just a real bug that nobody had noticed before because the memory leak wasn't all that noticeable with smaller files and/or less deep histories. Can'you verify that it works for you with the patch I sent out? With that fix, I could even run git blame -C gcc/ChangeLog-2000 to see the blame machinery work past the strange "combine many different changelogs into year-based ones" commit. Now, I cannot honestly claim that it was really *usable* (it did take three minutes to run!), but sometimes those three minutes of CPU time may be worth it, if it shows the real historical context it came from. In the case of the ChangeLog-2000 file, all the original lines obviously came from older versions of a file called "gcc/ChangeLog", so the end result doesn't really show what an involved situation it was to track the sources back through not just renames, but actually file splits and merges. Sad, but once you know what it did it's still a bit cool to see that it worked ;) Linus -
That doesn't mean we can't opportunistically jump layers when available, and fall back on the regular behavior otherwise. The nice thing about clean and simple layers is that you can always add optimizations later by poking sane holes. Let's assume for the sake of argument that we can convert an xdelta into a diff fairly cheaply. Using the patch below, we can count the places where we are diffing two blobs, and one blob is a delta base of the other (assuming our magical conversion function can also reverse diffs. ;) ). For a "git log -p" on git.git, I get: 9951 diffs could be optimized 10958 diffs could not be optimized or about 48%. It would be nice if we could drop the cost by almost 50% (if our magical function is free to call, too!). Of course, I haven't even looked at whether converting xdeltas to unified diffs is possible. I suspect in some cases it is (e.g., pure addition of text) and in some cases it isn't (I assume xdelta doesn't have any context lines, which might hurt). And it's possible that a specialized diff user like git-blame can just learn to use the xdeltas by itself (I didn't get a "could optimize" count for git-blame since it seems to follow a different codepath for its diffs). --- diff --git a/cache.h b/cache.h index 27d90fe..0d672be 100644 --- a/cache.h +++ b/cache.h @@ -569,6 +569,7 @@ extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsign extern unsigned long unpack_object_header_gently(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep); extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t); extern const char *packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *); +extern int have_xdelta(unsigned char from[20], unsigned char to[20]); extern int matches_pack_name(struct packed_git *p, const char *name); /* Dumb servers support */ diff --git a/diff.c ...
I think the answer here is that git-annotate is a totally different issue.
The blame machinery keeps around all the blobs it has ever needed to do a
diff, which explains why something like gcc/ChangeLog blows up badly.
Try this trivial patch.
It will cause us to potentially re-generate some blobs much more, but
that's a reasonably cheap operation, and our delta base cache will get the
expensive cases.
It's still not a free operation, but I get
[torvalds@woody gcc]$ /usr/bin/time ~/git/git-blame gcc/ChangeLog > /dev/null
20.68user 1.25system 0:21.94elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+599833minor)pagefaults 0swaps
so it took 22s and I never saw it grow very large either (it grew to 72M
resident, but I don't know how much of that was the mmap of the
pack-file, so that number is pretty meaningless). Valgrind reports that
it used a maximum heap of about 24M, and almost all of that seems to have
been in the delta cache (which is all good).
Linus
----
builtin-blame.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/builtin-blame.c b/builtin-blame.c
index c158d31..18f9924 100644
--- a/builtin-blame.c
+++ b/builtin-blame.c
@@ -87,6 +87,14 @@ struct origin {
char path[FLEX_ARRAY];
};
+static void drop_origin_blob(struct origin *o)
+{
+ if (o->file.ptr) {
+ free(o->file.ptr);
+ o->file.ptr = NULL;
+ }
+}
+
/*
* Given an origin, prepare mmfile_t structure to be used by the
* diff machinery
@@ -558,6 +566,8 @@ static struct patch *get_patch(struct origin *parent, struct origin *origin)
if (!file_p.ptr || !file_o.ptr)
return NULL;
patch = compare_buffer(&file_p, &file_o, 0);
+ drop_origin_blob(parent);
+ drop_origin_blob(origin);
num_get_patch++;
return patch;
}
-
While this should be safe (because the user of blob lazily re-fetches),
it feels a bit too aggressive, especially when -C or other "retry and
try harder to assign blame elsewhere" option is used.
Instead, how about discarding after we are done with each origin, like
this?
---
builtin-blame.c | 17 +++++++++++++++--
1 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/builtin-blame.c b/builtin-blame.c
index c158d31..eda79d0 100644
--- a/builtin-blame.c
+++ b/builtin-blame.c
@@ -130,6 +130,14 @@ static void origin_decref(struct origin *o)
}
}
+static void drop_origin_blob(struct origin *o)
+{
+ if (o->file.ptr) {
+ free(o->file.ptr);
+ o->file.ptr = NULL;
+ }
+}
+
/*
* Each group of lines is described by a blame_entry; it can be split
* as we pass blame to the parents. They form a linked list in the
@@ -1274,8 +1282,13 @@ static void pass_blame(struct scoreboard *sb, struct origin *origin, int opt)
}
finish:
- for (i = 0; i < MAXPARENT; i++)
- origin_decref(parent_origin[i]);
+ for (i = 0; i < MAXPARENT; i++) {
+ if (parent_origin[i]) {
+ drop_origin_blob(parent_origin[i]);
+ origin_decref(parent_origin[i]);
+ }
+ }
+ drop_origin_blob(origin);
}
/*
-
Sure, looks fine to me. With either of these patches, all of the cost is in the diffing routines: samples % image name app name symbol name 191317 31.4074 git git xdl_hash_record 120060 19.7096 git git xdl_recmatch 99286 16.2992 git git xdl_prepare_ctx 56370 9.2539 libc-2.7.so libc-2.7.so memcpy 23315 3.8275 git git xdl_prepare_env .. and while I suspect xdiff could be optimized a bit more for the cases where we have no changes at the end, that's beyond my skills. Linus -
Ok, I lied.
Nothing is beyond my skills. My mad k0der skillz are unbeatable.
This speeds up git-blame on ChangeLog-style files by a big amount, by just
ignoring the common end that we don't care about, since we don't want any
context anyway at that point. So I now get:
[torvalds@woody gcc]$ time git blame gcc/ChangeLog > /dev/null
real 0m7.031s
user 0m6.852s
sys 0m0.180s
which seems quite reasonable, and is about three times faster than trying
to diff those big files.
Davide: this really _does_ make a huge difference. Maybe xdiff itself
should do this optimization on its own, rather than have the caller hack
around the fact that xdiff doesn't handle this common case all that well?
The same thing obviously works for the beginning-of-file too, but then you
have to play games with line numbers being affected etc, so the end is the
rather much easier case and is the case that a ChangeLog-style file cares
about.
Daniel, this is obviously on top of the patches that fix the memory leak.
Linus
---
diff --git a/builtin-blame.c b/builtin-blame.c
index c158d31..677188c 100644
--- a/builtin-blame.c
+++ b/builtin-blame.c
@@ -543,6 +551,20 @@ static struct patch *compare_buffer(mmfile_t *file_p, mmfile_t *file_o,
return state.ret;
}
+#define BLOCK 1024
+
+static void truncate_common_data(mmfile_t *a, mmfile_t *b)
+{
+ long l1 = a->size, l2 = b->size;
+
+ while ((l1 -= BLOCK) > 0 && (l2 -= BLOCK) > 0) {
+ if (memcmp(a->ptr + l1, b->ptr + l2, BLOCK))
+ break;
+ a->size = l1;
+ b->size = l2;
+ }
+}
+
/*
* Run diff between two origins and grab the patch output, so that
* we can pass blame for lines origin is currently suspected for
@@ -557,6 +579,7 @@ static struct patch *get_patch(struct origin *parent, struct origin *origin)
fill_origin_blob(origin, &file_o);
if (!file_p.ptr || !file_o.ptr)
return NULL;
+ truncate_common_data(&file_p, &file_o);
patch = compare_buffer(&file_p, &file_o, 0);
...I didn't follow the thread, but I can guess from the subject that this is about memory, isn't it? Libxdiff already has a xdl_trim_ends() that strips all the common beginning and ending records, but at that point files are already loaded. Since libxdiff works with memory files in order to keep any sort of system dependency out of the window, so the optimization would be useless on libxdiff side. This because the user would have to have already the file loaded in memory, to pass it to libxdiff. If this is really about memory, this better be kept on the libxdiff caller side, so that it can avoid loading the terminal file sections altogether. About your code, you may want to have an extend-till-next-eol code after the trimming part, since the last line may be used for context in the diffs. - Davide -
That's not the problem. The problem with xdl_trim_ends() is that it happens *after* you have done all the hashing, so as an optimization it's fairly useless, because it still leaves the real cost (the per-line hashing) on the table. So doing the trimming of the ends before you do even that, allows you to just do the trivial "let's see if the ends are identical" with a plain memcmp, which is much faster. Linus -
Careful. The real cost of diffing, is not the O(1) pass of the prepare phase. It's the potentially O(N*M) worst case of the cross-record compare. So that optimization is far from useless. That optimization is indeed Yes, tail trimming done on a block-basis is faster and does not consume memory. The code for libxdiff would have to be a bit more complex though, since memory files can be composed by many sections, of different sizes (so you cannot just assume it's a single block you're trimming the end). Also, you'd need some code at the end that hands you back at least the N lines you want for context. - Davide -
I'm not saying it's useless. I'm saying it's ineffective.
My simple patch that you saw, speeded up a real-life case by A FACTOR OF
Sure. The special case I added it to specifically wanted a context of zero
in the caller, so I could just ignore that.
But doing this in general and handing back the context is a simple matter
of
while (size < orig && context_lines) {
if (src->buffer[size++] == '\n')
context_lines--;
}
which will usually hit in a really short time (ie three lines by default,
just a few tens of bytes).
Linus
-
Sorry, I _did_ call it "fairly useless". The rest of the comment stands. I'm sure the trimming that xdiff does is good at avoiding some common O(n*m) cases, it's just not as good as it could be, and leaves a big constant factor of the O(n) case on the table. Linus -
Funny. I did not understand what you were talking about "no changes at
the end" when I read it ('cause I am at work and do not have the data
you are looking at handy), but now I see what you meant. It is a cute
hack that optimizes for a very special case of "prepend only" files (aka
"ChangeLog").
I suspect that this optimization has an interesting corner case, though.
What happens if you chomp at the middle of the last line that is
different between the two files? xdiff will report the line number but
wouldn't its (now artificial) "No newline at the end of the file" affect
the blame logic?
Besides, "prepend only" (or "append only") files would be good
candidates for the original -S"pickaxe" search, I would imagine, and
unless you are looking at that ChangeLog-2000 consolidated log, isn't
blame way overkill?
-
It shouldn't. I thought about it, but there doesn't seem to be any reason why blame could possibly care - the message can come at the end of a _real_ file, of course, so if the extra message confuses the blame logic, there's already a bug there. Actually, I suspect that this makes a difference for totally normal files too. I bet it cuts the size of the files to be tested for the common case (ie just a few small changes) down by 30-50% even on average. The fact that it cuts it down by 99.9% on ChangeLog files is just an added bonus. As Davide mentioned, xdiff actually does something like that hack for the beginning and end of files internally _anyway_, the problem with that is that it does it so late that it's already done a fairly expensive hash for the file (and allocated space for it based on guesses that are in turn based on the original size) that it doesn't actually get the full effect of the optimization. Linus -
This *should* trigger the special case: mkdir test-dir cd test-dir git init (echo -n a ; yes '' | dd count=2) > file git add file git commit -m "'a' + 1k newlines" (echo -n b ; yes '' | dd count=2) > file git add file git commit -m "'b' + 1k newlines" and it all seems to work fine. But I didn't actually check that it really triggered, this is just creating a 1025-byte file that has a single character and then 1024 newlines. So when the logic removes the shared tail (all the newlines), it leaves a single-character newlineless buffer for diff, and no, git-blame didn't care, and got the right answer. Linus -
Thanks, these patches work *great*. I'm starting to have a few users who have no experience with git or hg try their daily workflow with it, to see what UI issues they come up with :) -
It's been a while for me to look at the blame engine, and it hit me that it would be interesting to run assign_blame() loop on multi-core machine in parallel threads. -
I'm not surprised at all. We had a number of issues with SVN that needed to be resolved. I'm basically trying to get issues worked (both on git and mercurial) out to the point where it is fair for our users to try their branch and trunk workflows with git and mercurial, and see which they like -
Linus Torvalds <torvalds@linux-foundation.org> wrote: git-gui waits to color until after it gets the move/copy annotations back from the -C -C -w second pass it does. This way the coloring is based on the original source location, not on the move/copy that caused it to be placed where it is now. I played around with this for a while and finally made it work the way it does as I assumed most users would want to see where something originally came from more than how it got moved to where it is now. IOW the (very expensive) -C -C -w pass is usually much more interesting than the default (fast) pass, so that is the line annotation data we color with. But it takes longer to get and is run second, so yea, coloring takes a while. -- Shawn. -
This could be useful for a command line tool but for a GUI the top down approach is a myth IMHO. In the GUI case what you actually end up doing (because a GUI allows it) is to start from the latest file version, check the code region you are interested then when you find the changed lines you _may_ want to double click and go to see how it was the file before that change and then perhaps start a new digging. I found this is my typical workflow with annotation info because I'm more interested not in what lines have changed but _why_ have changed and to do this you naturally end up digging in the past (and checking also the corresponding revisions patch as example in another tab) In this case the advantage of oldest to newest annotation algorithm is that you have _already_ annotated all the history so you can walk and dig back and forth among the different file versions without *any* additional delay. Marco -
My use of "git blame" is perhaps not typical, but I use it fairly often when I'm looking at a part of my company's code base that I'm not terribly familiar with. I've found it's the fastest way to figure out who to go ask about a particular block of code that I think is responsible for a bug, or more commonly, who to ask to review a change I'm making. "git log" is too coarse-grained to be useful for that purpose; it usually doesn't tell me which of the 500 revisions to the file I'm looking at introduced the actual line of code I want to change. To me that really has nothing whatsoever to do with git workflow or svn workflow; it happens well before I'm ready to do any kind of integration or commit or even, sometimes, before I've made any changes to any code at all. Given infinite spare time, one of the things I'd be strongly tempted to try to build would be some kind of blame cache. You could theoretically make blame pretty much instantaneous by doing something as simple as caching the per-line revision ID for each file in each revision in a shadow repository (or a shadow branch in the main repo) and keeping a map between shadow-repo revisions and real-repo ones. If the cache was of the form "one SHA1 hash per line in the original file" it would delta-compress pretty well. It'd be easy to update incrementally since you only need to walk back in history until you get to the most recently cached revision for each file, at which point you use the cached value for all the lines that haven't changed. Yeah, I know, code talks louder than words... -Steve -
There is always "pickaxe" search, i.e. $ git log -p -S'<string>' -- <file or pathspec> which can be used instead of blame (perhaps with --follow). And you can limit blame to the interesting region of file, and to interesting (important) range of revisions. [about blame cache] "git gui blame" uses incremental blame; if only it accepted range (file fragment) limiting, and if "reblame" (blame --reference=<rev>, blaming incrementally only lines which changed wrt. given revision) was implemented. BTW. qgit actually does blame using it's own "multiple files bottom-up blame" code (it would be nice to have it in core-git if possible, hint, hint), and does some caching, although I'm not sure if blame info also. You should try it, I think. -- Jakub Narebski Poland ShadeHawk on #git -
It has no excuse for eating up to 1.6GB or RAM though. That's plainly wrong. Nicolas -
git blame gcc/ChangeLog It needs 2.25GB of RAM to run without swapping That is pretty close to the same number the repack needs. -- Jon Smirl jonsmirl@gmail.com -
I've seen you pointing this kind of examples many times, but is that really different from what even SVN does? "svn log drivers/char" will also list atomic commits, and give me a filtered view of the global log. So, yes, that's cool, but I don't see a real difference between git and almost anything else (except CVS which really got this wrong, no big surprise). -- Matthieu -
Ok, BK and CVS both got this horribly wrong, which is why I care. Maybe this is one of the things SVN gets right. I seriously doubt it, though. Do you get *history* right, or do you just get a random list of commits? Of course, to see the difference, you need to do "gitk drivers/char" or use another of the log viewers that actually show you history too. A plain "git log" won't make it obvious (unless you actually ask for parent information and then just track the history in your head, in which case you don't really need an SCM in the first place ;) Linus -
No, it will get actual history (IE not just things that happen to have that path in the repository) -
OTOH svn has the result right, but the way it does that is horrible. When you svn log some/path, I think it just (basically) ask svn log for each file in that directory, and merge the logs together. This is "easy" for svn since it remembers "where this specific file" came from. So for svn it's just a matter of merging the individual files histories together. It may have a more clever implementation, but basically I believe it would be similar to that in the end. Of course, if you do something as stupid as: svn cp Makefile some/path/foo.c # completely rewrite foo.c svn commit then you'll have the history of `Makefile` melded into the some/path/foo.c svn log, which is completely horribly wrong. or if you do (which unlike the previous example isn't silly for so many good reasons): cp bar.c foo.c svn add foo.c svn commit then foo.c won't have bar.c history in its svn log. --=20 =C2=B7O=C2=B7 Pierre Habouzit =C2=B7=C2=B7O madcoder@debia= n.org OOO http://www.madism.org
What? We version directories too. We don't do svn log for each file in the directory when you request a path. We look at the history of the path, follow renames, etc. When you change foo/bar/fred.c, we consider it a change to foo/bar and foo/, and thus, they have new versions. I'm not sure where you get this crazy notion that we do anything with files when you ask about directories. -
Well, you don't get merge commit right with SVN, but that's a different issue (svn 1.5 is supposed to have something about merge history, I don't know how it's done ...). So, if by "history", you mean how branches interferred together, obviously, SVN is bad at this. But it's equally bad at "svn log dir/" and plain "svn log". But to simplify, if you take a linear history (no merge commits), "svn log dir/" give you the list of commits which changed something inside "dir/". As pointed out in other messages, the way it's done is really different from what git does. SVN does know a lot about directories, and records a lot about them at commit time, while git just considers them as file containers. Year, CVS got this terribly wrong. IIRC, it just took the log for individual messages, and mix them together, so a commit touching multiple files would appear several times. I've taken SVN as an extreme example, but at least bzr and mercurial have an approach very similar to git. So, to me, this particular point is something git obviously got right, but not a point where git is so different from the others. -- Matthieu -
Yeah, git just has higher goals. The time history really matters (or rather, what I call the "shape" of history) is when you are trying to merge, and you get a merge conflict. That's when you want to do gitk master merge ^merge-base -- files-that-are-unmerged and in fact this is such an important thing for me that there is a shorthand argument to do exactly that, ie: gitk --merge which shows the commits that touched the unmerged files graphically *with* the history being correct (ie you don't just get a random log of "these changes happened", you get the real history of the two branches as it Sure, linear history is trivial. But it's also almost totally uninteresting. Linus -
And I bet this is the exact same issue as the repack one. Do you still have the 2.1GB pack around? I bet annotate would eat much less memory in that case. Nicolas -
I do not, but i could remake it in a few days if it would help -
Well, depending on the amount of RAM in your machine, you might even not be able to remake it at the moment. I currently can't reproduce it myself due to the same out-of-memory issue. Nicolas -
Speed of annotation is mainly due to getting the file history more then calculating the actual annotation. I don't know *how* file history is stored in the others scm, perhaps is easier to retrieve, i.e. without a full walk across the revisions... In case you have qgit (especially the 2.0 version that is much faster in this feature) I would be very interested to have annotation times on this file. Indeed annotation times are shown splitted between file history retrieval, based on something along the lines of "git log -p -- <path>", and actual annotation calculation (fully internal at qgit). I would be interested in cold start and warm cache start (close the annotation tab and start annotation again). Thanks (a lot) Marco -
It is stored in an easier format. However, can you not simply provide side-indexes to do the annotation? I guess that own't work in git because you can change history (in other scm's, history is readonly so you could know the results for I will try to do this. -
I don't know how other scms work, but history is definitely readonly in git - whatever sha1 you have that describes a commit was calculated based on its ancestor commits. If you have a commit's id, it will *always* refer to the same thing - a tree state and its complete ancestry. -
As Linus pointed out annotation in git is "much slower and much more costly than just having a local history view to begin with". Indeed to annotate say kernel/sched.c the time is spent by git while executing git log -p -- kernel/sched.c could be also 10X higher the the following annotation processing time starting from the git log output. Unfortunately my knowledge of git internals falls far far shorter then guessing what could be done to increase the *one file* history case Thanks. Very appreciated. -
History in git is read-only. It's just that git lets you fork and move forward with something different. Each commit can never change (and, in fact, you'd have to badly break SHA1 to change it), but which commits are relevant to the history can change. Keeping extra information is fine; at worst, it'll go irrelevant. -Daniel *This .sig left intentionally blank* -
Well, revisions never change, but history intended as revision's parent information could and do changes when you use a path delimiter. So does the graph that is a direct visualization of parent information. For a single revision (that modifies say 3 files) you can have at leat 3 different histories and acutally more if you want to visualize also the history of the directories trees that owns the modified files. You end up with a quite big number of different histories all showing your revisions in different ways, according to the path delimiter you use. Perhaps the intended meaning of "changing histories" is this, and in any case is this the reason you cannot (or has no sense to do) "save" a single file history in git. Marco -
A less unwieldy repository that shows the same problem is: svn://svn.debian.org/secure-testing/ It's annotating the data/CVE/list file that uses tons of memory. I guess you don't need to clone the full history to exhibit the problem. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 -
