Re: git annotate runs out of memory

Previous thread: [PATCH] stupid typo in git-checkout.sh by Pierre Habouzit on Tuesday, December 11, 2007 - 9:09 am. (4 messages)

Next thread: Re: Something is broken in repack by Nicolas Pitre on Tuesday, December 11, 2007 - 11:57 am. (1 message)
From: Daniel Berlin
Date: Tuesday, December 11, 2007 - 10:33 am

On the gcc repository (which is now a 234 meg pack for me), git
annotate ChangeLog takes > 800 meg of memory (I stopped it at about
1.6 gig, since it started swapping my machine).
I assume it will run out of memory.  I stopped it after 2 minutes.

Mercurial, on the same file, takes 50 meg and 30 seconds.


git annotate fold-const.c takes 300 meg of memory and takes > 30 seconds.
Mercurial, on the same file takes 50 meg of memory and 10 seconds.
svn takes 15 seconds and 20 meg of memory.

I have excluded the mmap memory from mmap'ing the pack/file (in
git/mercurial respectively).

Annotate is treasured by gcc developers (this was a key sticking point
in svn conversion).
Having an annotate that is 2x slower and takes 15x memory would not
fly (regardless of how good the results are).

This seems to be a common problem with git. It seems to use a lot of
memory to perform common operations on the gcc repository (even though
it is faster in some cases than hg).

--Dan
-

From: Linus Torvalds
Date: Tuesday, December 11, 2007 - 11:40 am

The thing is, git has a very different notion of "common operations" than 
you do.

To git, "git annotate" is just about the *last* thing you ever want to do. 
It's not a common operation, it's a "last resort" operation. In git, the 
whole workflow is designed for "git log -p <pathnamepattern>" rather than 
annotate/blame.

In fact, we didn't support annotate at all for the first year or so of 
git.

The reason for git being relatively slow is exactly that git doesn't have 
"file history" at all, and only tracks full snapshots. So "git blame" is 
really a very complex operation that basically looks at the global history 
(because nothing else exists) and will basically generate a totally 
different "view" of local history from that one.

The disadvantage is that it's much slower and much more costly than just 
having a local history view to begin with.

However, the absolutely *huge* advantage is that it isn't then limited to 
local history.

So where git shines is when you actually use the global history, and do 
merges or when you track more than one file (which others find hard, but 
git finds much more natural).

An examples of this is content that actually comes from multiple files. 
File-based systems simply cannot do this at all. They aren't just slower, 
they are totally unable to do it sanely. For git, it's all the same: it 
never really cares about file boundaries in the first place.

The other example is doing things like "git log -p drivers/char", where 
you don't ask for the log of a single file, but a general file pattern, 
and get (still atomic!) commits as the result.

And perhaps the best example is just tracking code when you have two files 
that merge into one (possibly because the "same" file was created 
independently in two different branches). git gets things like that right 
without even thinking about it. Others tend to just flounder about and 
can't do anything at all about it.

That said, I'll see if I can speed up "git blame" on the ...
From: Daniel Berlin
Date: Tuesday, December 11, 2007 - 12:09 pm

I understand this, and completely agree with you.
However, I cannot force GCC people to adopt completely new workflow in
this regard.
The changelog's are not useful enough (and we've had huge fights over
this) to do git log -p and figure out the info we want.
Looking through thousands of diffs to find the one that happened to
your line is also pretty annoying.
Annotate is a major use for gcc developers as a result

SVN had the same problem (the file retrieval was the most expensive op
on FSFS). One of the things i did to speed it up tremendously was to
do the annotate from newest to oldest (IE in reverse), and stop
annotating when we had come up with annotate info for all the lines.
If you can't speed up file retrieval itself, you can make it need less
files :)
In GCC history, it is likely you will be able to cut off at least 30%
of the time if you do this, because files often have changed entirely
multiple times.
-

From: Daniel Barkalow
Date: Tuesday, December 11, 2007 - 12:26 pm

Unfortunately, we're doing that already. One improvement that is already 
available is that we can do progressive annotate: we can output lines we 
find in the order we find them, such that lines that changed recently 
(which are usually the more interesting ones) get annotated quicker. 
Obviously, you need a GUI-ish thing to do this, because pagers don't like 
having stuff written out of order, but there's a good chance that a user 
annotating fold-const.c will have the info for the interesting lines in a 
few seconds, and go on while git is still trying to find where the boring 
old lines came from.

There's also the possibility of generating caches of commit:file pairs 
you've annotated, which would make generating the annotation for something 
you'd annotated for a recent commit blindingly fast.

	-Daniel
*This .sig left intentionally blank*
-

From: Pierre Habouzit
Date: Tuesday, December 11, 2007 - 12:34 pm

If the question you want to answer is "what happened to that line"
then using git annotate is using a big hammer for no good reason.

git log -S'<put the content of the line here>' -- path/to/file.c

will give you the very same answer, pointing you to the changes that
added or removed that line directly. It's not a fast command either, but
it should be less resource hungry than annotate that has to do roughly
the same for all lines whereas you're interested in one only.

The direct plus here, is that git log output is incremental, so you have
answers about the first diffs quite quick, which let you examine the
first answers while the rest is still being computed.

Unlike git annotate, this also allow you to restrict the revisions
where it searches to a range where you know this happened, which makes
it almost instantaneous in most cases.

Of course, if the line is '    free(p);\n' then you will probably have
quite a few false positives, but with the path restriction, I assume
this will still be quite accurate.

What is important here is to know what is the real question the GCC
programmers want to answer to. It seems to me that `blame` is an
overkill for the underlying issue.


Note that it does not justifies the current memory consumption that just
looks bad and wrong to me, but this aims at finding a way to answer your
question doing just what you need to answer it and not gazillions of
other things :)
--=20
=C2=B7O=C2=B7  Pierre Habouzit
=C2=B7=C2=B7O                                                madcoder@debia=
n.org
OOO                                                http://www.madism.org
From: Junio C Hamano
Date: Tuesday, December 11, 2007 - 12:59 pm

Yes, but blame also takes revision bottoms (obviously you have to start
digging from a single revision so "blame master..next pu" would not

You can feed more than a line from -S, and the assumed and recommended

Right.

-

From: Linus Torvalds
Date: Tuesday, December 11, 2007 - 12:42 pm

Oh, I agree. It's why we do have "git blame" these days, and it's why I've 
tried to make people use the nicer incremental mode, which is not at all 
faster, but it's a hell of a lot more pleasant to use because you get some 
output immediately.

In other words,

	git blame gcc/ChangeLog

is virtually useless because it's too expensive, but try doing

	git gui blame gcc ChangeLog

instead, and doesn't that just seem nicer? (*)

The difference is that the GUI one does it incrementally, and doesn't have 
to get _all_ the results before it can start reporting blame.

Not that I claim that the gui blame is perfect either (I dunno why it 
delays the nice coloring so long, for example), but it was something I 
pushed - and others made the gui for - exactly to help people with the 

We do that. The expense for git is that we don't do the revisions as a 
single file at all. We'll look through each commit, check whether the 
"gcc" directory changed, if it did, we'll go into it, and check whether 
the "ChangeLog" file changed - and if it did, we'll actually diff it 

Not gcc/ChangeLog, though (apart from the renames that happen 
occasionally).

Btw, an example of something git *should* do right, but is just too damn 
expensive, is doing

	git gui blame gcc/ChangeLog-2000

and have it actually be able to track the original source of each of those 
annotations across that "ChangeLog split from hell". 

I bet it would eventually get it right, but that's a large file, way back 
in history, and it will try to do a non-whitespace blame with copy 
detection.

That's *expensive*, although it is an amusing thing to try to do ;)

			Linus

PS. I also do agree that we seem to use an excessive amount of memory 
there. As to whether it's the same issue or not, I'd not go as far as Nico 
and say "yes" yet. But it's interesting.

It's not entirely surprising that we see multiple issues with the gcc 
repo, simply because it's not the kind of repo that people have ever 
really ...
From: Linus Torvalds
Date: Tuesday, December 11, 2007 - 12:50 pm

And, btw: the diff is totally different from the xdelta we have, so even 
if we have an already prepared nice xdelta between the two versions, we'll 
end up re-generating the files in full, and then do a diff on the end 
result.

Of course, part of that is that git logically *never* works with deltas, 
except in the actual code-paths that generate objects (or generate packs, 
of course). So even if we had used a delta algorithm that would be 
amenable to be turned into a diff directly, it would have been a layering 
violation to actually do that.

Other systems can sometimes just re-use their deltas to generate the 
diffs and/or blame information. I dunno whether SVN does that. CVS does, 
afaik.

			Linus
-

From: Daniel Berlin
Date: Tuesday, December 11, 2007 - 2:14 pm

CVS does because it's delta is line based, so it's easy.

You theroetically can generate blame info from SVN/GIT's block deltas,
but you of course, have the problem GIT does, which is that the delta
is not meant to represent the actual changes that occurred, but
instead, the smallest way to reconstruct data x from data y.
This only sometimes has any relation to how the file actually changed
-

From: Linus Torvalds
Date: Tuesday, December 11, 2007 - 2:34 pm

Exactly. Git objects in themselves have no history or relationships, and 
being a delta against another object means nothing at all except for the 
fact that the data seems to resemble that other object (which has a 
_correlation_ with being related, but nothign more).

Anyway, I think the git annotate memory usage was simpyl just a real bug 
that nobody had noticed before because the memory leak wasn't all that 
noticeable with smaller files and/or less deep histories. Can'you verify 
that it works for you with the patch I sent out?

With that fix, I could even run 

	git blame -C gcc/ChangeLog-2000

to see the blame machinery work past the strange "combine many different 
changelogs into year-based ones" commit. Now, I cannot honestly claim that 
it was really *usable* (it did take three minutes to run!), but sometimes 
those three minutes of CPU time may be worth it, if it shows the real 
historical context it came from. 

In the case of the ChangeLog-2000 file, all the original lines obviously 
came from older versions of a file called "gcc/ChangeLog", so the end 
result doesn't really show what an involved situation it was to track the 
sources back through not just renames, but actually file splits and 
merges. Sad, but once you know what it did it's still a bit cool to see 
that it worked ;)

			Linus
-

From: Jeff King
Date: Wednesday, December 12, 2007 - 12:57 am

That doesn't mean we can't opportunistically jump layers when available,
and fall back on the regular behavior otherwise. The nice thing about
clean and simple layers is that you can always add optimizations later
by poking sane holes.

Let's assume for the sake of argument that we can convert an xdelta into
a diff fairly cheaply.  Using the patch below, we can count the places
where we are diffing two blobs, and one blob is a delta base of the
other (assuming our magical conversion function can also reverse diffs.
;) ).

For a "git log -p" on git.git, I get:

   9951 diffs could be optimized
  10958 diffs could not be optimized

or about 48%. It would be nice if we could drop the cost by almost 50%
(if our magical function is free to call, too!).

Of course, I haven't even looked at whether converting xdeltas to
unified diffs is possible. I suspect in some cases it is (e.g., pure
addition of text) and in some cases it isn't (I assume xdelta doesn't
have any context lines, which might hurt). And it's possible that a
specialized diff user like git-blame can just learn to use the xdeltas
by itself (I didn't get a "could optimize" count for git-blame since
it seems to follow a different codepath for its diffs).

---
diff --git a/cache.h b/cache.h
index 27d90fe..0d672be 100644
--- a/cache.h
+++ b/cache.h
@@ -569,6 +569,7 @@ extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsign
 extern unsigned long unpack_object_header_gently(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 extern const char *packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *);
+extern int have_xdelta(unsigned char from[20], unsigned char to[20]);
 extern int matches_pack_name(struct packed_git *p, const char *name);
 
 /* Dumb servers support */
diff --git a/diff.c ...
From: Linus Torvalds
Date: Tuesday, December 11, 2007 - 2:14 pm

I think the answer here is that git-annotate is a totally different issue.

The blame machinery keeps around all the blobs it has ever needed to do a 
diff, which explains why something like gcc/ChangeLog blows up badly.

Try this trivial patch.

It will cause us to potentially re-generate some blobs much more, but 
that's a reasonably cheap operation, and our delta base cache will get the 
expensive cases.

It's still not a free operation, but I get

	[torvalds@woody gcc]$ /usr/bin/time ~/git/git-blame gcc/ChangeLog > /dev/null
	20.68user 1.25system 0:21.94elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
	0inputs+0outputs (0major+599833minor)pagefaults 0swaps

so it took 22s and I never saw it grow very large either (it grew to 72M 
resident, but I don't know how much of that was the mmap of the 
pack-file, so that number is pretty meaningless). Valgrind reports that 
it used a maximum heap of about 24M, and almost all of that seems to have 
been in the delta cache (which is all good).

		Linus

----
 builtin-blame.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/builtin-blame.c b/builtin-blame.c
index c158d31..18f9924 100644
--- a/builtin-blame.c
+++ b/builtin-blame.c
@@ -87,6 +87,14 @@ struct origin {
 	char path[FLEX_ARRAY];
 };
 
+static void drop_origin_blob(struct origin *o)
+{
+	if (o->file.ptr) {
+		free(o->file.ptr);
+		o->file.ptr = NULL;
+	}
+}
+
 /*
  * Given an origin, prepare mmfile_t structure to be used by the
  * diff machinery
@@ -558,6 +566,8 @@ static struct patch *get_patch(struct origin *parent, struct origin *origin)
 	if (!file_p.ptr || !file_o.ptr)
 		return NULL;
 	patch = compare_buffer(&file_p, &file_o, 0);
+	drop_origin_blob(parent);
+	drop_origin_blob(origin);
 	num_get_patch++;
 	return patch;
 }
-

From: Junio C Hamano
Date: Tuesday, December 11, 2007 - 2:54 pm

While this should be safe (because the user of blob lazily re-fetches),
it feels a bit too aggressive, especially when -C or other "retry and
try harder to assign blame elsewhere" option is used.

Instead, how about discarding after we are done with each origin, like
this?

---
 builtin-blame.c |   17 +++++++++++++++--
 1 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/builtin-blame.c b/builtin-blame.c
index c158d31..eda79d0 100644
--- a/builtin-blame.c
+++ b/builtin-blame.c
@@ -130,6 +130,14 @@ static void origin_decref(struct origin *o)
 	}
 }
 
+static void drop_origin_blob(struct origin *o)
+{
+	if (o->file.ptr) {
+		free(o->file.ptr);
+		o->file.ptr = NULL;
+	}
+}
+
 /*
  * Each group of lines is described by a blame_entry; it can be split
  * as we pass blame to the parents.  They form a linked list in the
@@ -1274,8 +1282,13 @@ static void pass_blame(struct scoreboard *sb, struct origin *origin, int opt)
 		}
 
  finish:
-	for (i = 0; i < MAXPARENT; i++)
-		origin_decref(parent_origin[i]);
+	for (i = 0; i < MAXPARENT; i++) {
+		if (parent_origin[i]) {
+			drop_origin_blob(parent_origin[i]);
+			origin_decref(parent_origin[i]);
+		}
+	}
+	drop_origin_blob(origin);
 }
 
 /*




-

From: Linus Torvalds
Date: Tuesday, December 11, 2007 - 4:36 pm

Sure, looks fine to me. With either of these patches, all of the cost is 
in the diffing routines:

	samples  %        image name               app name                 symbol name
	191317   31.4074  git                      git                      xdl_hash_record
	120060   19.7096  git                      git                      xdl_recmatch
	99286    16.2992  git                      git                      xdl_prepare_ctx
	56370     9.2539  libc-2.7.so              libc-2.7.so              memcpy
	23315     3.8275  git                      git                      xdl_prepare_env
	..

and while I suspect xdiff could be optimized a bit more for the cases 
where we have no changes at the end, that's beyond my skills.

		Linus

-

From: Linus Torvalds
Date: Tuesday, December 11, 2007 - 5:02 pm

Ok, I lied.

Nothing is beyond my skills. My mad k0der skillz are unbeatable.

This speeds up git-blame on ChangeLog-style files by a big amount, by just 
ignoring the common end that we don't care about, since we don't want any 
context anyway at that point. So I now get:

	[torvalds@woody gcc]$ time git blame gcc/ChangeLog > /dev/null

	real    0m7.031s
	user    0m6.852s
	sys     0m0.180s

which seems quite reasonable, and is about three times faster than trying 
to diff those big files.

Davide: this really _does_ make a huge difference. Maybe xdiff itself 
should do this optimization on its own, rather than have the caller hack 
around the fact that xdiff doesn't handle this common case all that well?

The same thing obviously works for the beginning-of-file too, but then you 
have to play games with line numbers being affected etc, so the end is the 
rather much easier case and is the case that a ChangeLog-style file cares 
about.

Daniel, this is obviously on top of the patches that fix the memory leak.

			Linus

---
diff --git a/builtin-blame.c b/builtin-blame.c
index c158d31..677188c 100644
--- a/builtin-blame.c
+++ b/builtin-blame.c
@@ -543,6 +551,20 @@ static struct patch *compare_buffer(mmfile_t *file_p, mmfile_t *file_o,
 	return state.ret;
 }
 
+#define BLOCK 1024
+
+static void truncate_common_data(mmfile_t *a, mmfile_t *b)
+{
+	long l1 = a->size, l2 = b->size;
+
+	while ((l1 -= BLOCK) > 0 && (l2 -= BLOCK) > 0) {
+		if (memcmp(a->ptr + l1, b->ptr + l2, BLOCK))
+			break;
+		a->size = l1;
+		b->size = l2;
+	}
+}
+
 /*
  * Run diff between two origins and grab the patch output, so that
  * we can pass blame for lines origin is currently suspected for
@@ -557,6 +579,7 @@ static struct patch *get_patch(struct origin *parent, struct origin *origin)
 	fill_origin_blob(origin, &file_o);
 	if (!file_p.ptr || !file_o.ptr)
 		return NULL;
+	truncate_common_data(&file_p, &file_o);
 	patch = compare_buffer(&file_p, &file_o, 0);
 ...
From: Davide Libenzi
Date: Tuesday, December 11, 2007 - 5:22 pm

I didn't follow the thread, but I can guess from the subject that this is 
about memory, isn't it?
Libxdiff already has a xdl_trim_ends() that strips all the common 
beginning and ending records, but at that point files are already loaded.
Since libxdiff works with memory files in order to keep any sort of 
system dependency out of the window, so the optimization would be 
useless on libxdiff side. This because the user would have to have 
already the file loaded in memory, to pass it to libxdiff.
If this is really about memory, this better be kept on the libxdiff caller 
side, so that it can avoid loading the terminal file sections altogether.
About your code, you may want to have an extend-till-next-eol code after 
the trimming part, since the last line may be used for context in the 
diffs.




- Davide


-

From: Linus Torvalds
Date: Tuesday, December 11, 2007 - 5:50 pm

That's not the problem. The problem with xdl_trim_ends() is that it 
happens *after* you have done all the hashing, so as an optimization it's 
fairly useless, because it still leaves the real cost (the per-line 
hashing) on the table.

So doing the trimming of the ends before you do even that, allows you to 
just do the trivial "let's see if the ends are identical" with a plain 
memcmp, which is much faster.

			Linus
-

From: Davide Libenzi
Date: Tuesday, December 11, 2007 - 6:12 pm

Careful. The real cost of diffing, is not the O(1) pass of the prepare 
phase. It's the potentially O(N*M) worst case of the cross-record compare. 
So that optimization is far from useless. That optimization is indeed 

Yes, tail trimming done on a block-basis is faster and does not consume 
memory. The code for libxdiff would have to be a bit more complex though, 
since memory files can be composed by many sections, of different sizes 
(so you cannot just assume it's a single block you're trimming the end). 
Also, you'd need some code at the end that hands you back at least the N 
lines you want for context.



- Davide


-

From: Linus Torvalds
Date: Tuesday, December 11, 2007 - 7:10 pm

I'm not saying it's useless. I'm saying it's ineffective.

My simple patch that you saw, speeded up a real-life case by A FACTOR OF 

Sure. The special case I added it to specifically wanted a context of zero 
in the caller, so I could just ignore that.

But doing this in general and handing back the context is a simple matter 
of

	while (size < orig && context_lines) {
		if (src->buffer[size++] == '\n')
			context_lines--;
	}

which will usually hit in a really short time (ie three lines by default, 
just a few tens of bytes).
			
		Linus
-

From: Linus Torvalds
Date: Tuesday, December 11, 2007 - 8:35 pm

Sorry, I _did_ call it "fairly useless". 

The rest of the comment stands. I'm sure the trimming that xdiff does is 
good at avoiding some common O(n*m) cases, it's just not as good as it 
could be, and leaves a big constant factor of the O(n) case on the table.

			Linus
-

From: Junio C Hamano
Date: Tuesday, December 11, 2007 - 5:56 pm

Funny.  I did not understand what you were talking about "no changes at
the end" when I read it ('cause I am at work and do not have the data
you are looking at handy), but now I see what you meant.  It is a cute
hack that optimizes for a very special case of "prepend only" files (aka
"ChangeLog").

I suspect that this optimization has an interesting corner case, though.
What happens if you chomp at the middle of the last line that is
different between the two files?  xdiff will report the line number but
wouldn't its (now artificial) "No newline at the end of the file" affect
the blame logic?

Besides, "prepend only" (or "append only") files would be good
candidates for the original -S"pickaxe" search, I would imagine, and
unless you are looking at that ChangeLog-2000 consolidated log, isn't
blame way overkill?


-

From: Linus Torvalds
Date: Tuesday, December 11, 2007 - 7:20 pm

It shouldn't. I thought about it, but there doesn't seem to be any reason 
why blame could possibly care - the message can come at the end of a 
_real_ file, of course, so if the extra message confuses the blame logic, 
there's already a bug there. 


Actually, I suspect that this makes a difference for totally normal files 
too. I bet it cuts the size of the files to be tested for the common case 
(ie just a few small changes) down by 30-50% even on average. The fact 
that it cuts it down by 99.9% on ChangeLog files is just an added bonus.

As Davide mentioned, xdiff actually does something like that hack for the 
beginning and end of files internally _anyway_, the problem with that is 
that it does it so late that it's already done a fairly expensive hash for 
the file (and allocated space for it based on guesses that are in turn 
based on the original size) that it doesn't actually get the full effect 
of the optimization.

			Linus
-

From: Linus Torvalds
Date: Tuesday, December 11, 2007 - 7:39 pm

This *should* trigger the special case:

	mkdir test-dir
	cd test-dir
	git init
	(echo -n a ; yes '' | dd count=2) > file
	git add file
	git commit -m "'a' + 1k newlines"
	(echo -n b ; yes '' | dd count=2) > file
	git add file
	git commit -m "'b' + 1k newlines"

and it all seems to work fine.

But I didn't actually check that it really triggered, this is just 
creating a 1025-byte file that has a single character and then 1024 
newlines. So when the logic removes the shared tail (all the newlines), it 
leaves a single-character newlineless buffer for diff, and no, git-blame 
didn't care, and got the right answer.

			Linus
-

From: Daniel Berlin
Date: Wednesday, December 12, 2007 - 12:43 pm

Thanks, these patches work *great*.

I'm starting to have a few users who have no experience with git or hg
try their daily workflow with it, to see what UI issues they come up
with :)
-

From: Junio C Hamano
Date: Tuesday, December 11, 2007 - 9:48 pm

It's been a while for me to look at the blame engine, and it hit me that
it would be interesting to run assign_blame() loop on multi-core machine
in parallel threads.


-

From: Daniel Berlin
Date: Tuesday, December 11, 2007 - 2:24 pm

I'm not surprised at all.
We had a number of issues with SVN that needed to be resolved.
I'm basically trying to get issues worked (both on git and mercurial)
out to the point where it is fair for our users to try their branch
and trunk workflows with git and mercurial,  and see which they like
-

From: Shawn O. Pearce
Date: Tuesday, December 11, 2007 - 8:57 pm

Linus Torvalds <torvalds@linux-foundation.org> wrote:

git-gui waits to color until after it gets the move/copy annotations
back from the -C -C -w second pass it does.  This way the coloring
is based on the original source location, not on the move/copy that
caused it to be placed where it is now.

I played around with this for a while and finally made it work the
way it does as I assumed most users would want to see where something
originally came from more than how it got moved to where it is now.

IOW the (very expensive) -C -C -w pass is usually much more
interesting than the default (fast) pass, so that is the line
annotation data we color with.  But it takes longer to get and
is run second, so yea, coloring takes a while.

-- 
Shawn.
-

From: Marco Costalba
Date: Tuesday, December 11, 2007 - 1:29 pm

This could be useful for a command line tool but for a GUI the top
down approach is a myth IMHO.

In the GUI case what you actually end up doing (because a GUI allows
it) is to start from the latest file version, check the code region
you are interested then when you find the changed lines you _may_ want
to double click and go to see how it was the file before that change
and then perhaps start a new digging.

I found this is my typical workflow with annotation info because I'm
more interested not in what lines have changed but _why_ have changed
and to do this you naturally end up digging in the past (and checking
also the corresponding revisions patch as example in another tab)

In this case the advantage of oldest to newest annotation algorithm is
that you have _already_ annotated all the history so you can walk and
dig back and forth among the different file versions without *any*
additional delay.

Marco
-

From: Steven Grimm
Date: Tuesday, December 11, 2007 - 12:29 pm

My use of "git blame" is perhaps not typical, but I use it fairly  
often when I'm looking at a part of my company's code base that I'm  
not terribly familiar with. I've found it's the fastest way to figure  
out who to go ask about a particular block of code that I think is  
responsible for a bug, or more commonly, who to ask to review a change  
I'm making.

"git log" is too coarse-grained to be useful for that purpose; it  
usually doesn't tell me which of the 500 revisions to the file I'm  
looking at introduced the actual line of code I want to change.

To me that really has nothing whatsoever to do with git workflow or  
svn workflow; it happens well before I'm ready to do any kind of  
integration or commit or even, sometimes, before I've made any changes  
to any code at all.

Given infinite spare time, one of the things I'd be strongly tempted  
to try to build would be some kind of blame cache. You could  
theoretically make blame pretty much instantaneous by doing something  
as simple as caching the per-line revision ID for each file in each  
revision in a shadow repository (or a shadow branch in the main repo)  
and keeping a map between shadow-repo revisions and real-repo ones. If  
the cache was of the form "one SHA1 hash per line in the original  
file" it would delta-compress pretty well. It'd be easy to update  
incrementally since you only need to walk back in history until you  
get to the most recently cached revision for each file, at which point  
you use the cached value for all the lines that haven't changed.

Yeah, I know, code talks louder than words...

-Steve
-

From: Jakub Narebski
Date: Tuesday, December 11, 2007 - 1:14 pm

There is always "pickaxe" search, i.e. 
  $ git log -p -S'<string>' -- <file or pathspec>
which can be used instead of blame (perhaps with --follow).

And you can limit blame to the interesting region of file, and to
interesting (important) range of revisions.


[about blame cache]

"git gui blame" uses incremental blame; if only it accepted range
(file fragment) limiting, and if "reblame" (blame --reference=<rev>,
blaming incrementally only lines which changed wrt. given revision)
was implemented.

BTW. qgit actually does blame using it's own "multiple files bottom-up
blame" code (it would be nice to have it in core-git if possible,
hint, hint), and does some caching, although I'm not sure if blame
info also. You should try it, I think.

-- 
Jakub Narebski
Poland
ShadeHawk on #git
-

From: Nicolas Pitre
Date: Tuesday, December 11, 2007 - 12:06 pm

It has no excuse for eating up to 1.6GB or RAM though.  That's plainly 
wrong.


Nicolas
-

From: Jon Smirl
Date: Tuesday, December 11, 2007 - 1:31 pm

git blame gcc/ChangeLog
It needs 2.25GB of RAM to run without swapping

That is pretty close to the same number the repack needs.

-- 
Jon Smirl
jonsmirl@gmail.com
-

From: Matthieu Moy
Date: Tuesday, December 11, 2007 - 12:01 pm

I've seen you pointing this kind of examples many times, but is that
really different from what even SVN does? "svn log drivers/char" will
also list atomic commits, and give me a filtered view of the global
log.

So, yes, that's cool, but I don't see a real difference between git
and almost anything else (except CVS which really got this wrong, no
big surprise).

-- 
Matthieu
-

From: Linus Torvalds
Date: Tuesday, December 11, 2007 - 12:22 pm

Ok, BK and CVS both got this horribly wrong, which is why I care. Maybe 
this is one of the things SVN gets right.

I seriously doubt it, though. Do you get *history* right, or do you just 
get a random list of commits?

Of course, to see the difference, you need to do "gitk drivers/char" or 
use another of the log viewers that actually show you history too. A plain 
"git log" won't make it obvious (unless you actually ask for parent 
information and then just track the history in your head, in which case 
you don't really need an SCM in the first place ;)

			Linus
-

From: Daniel Berlin
Date: Tuesday, December 11, 2007 - 12:24 pm

No, it will get actual history (IE not just things that happen to have
that path in the repository)
-

From: Pierre Habouzit
Date: Tuesday, December 11, 2007 - 12:42 pm

OTOH svn has the result right, but the way it does that is horrible.
When you svn log some/path, I think it just (basically) ask svn log for
each file in that directory, and merge the logs together. This is "easy"
for svn since it remembers "where this specific file" came from.

So for svn it's just a matter of merging the individual files histories
together. It may have a more clever implementation, but basically I
believe it would be similar to that in the end.

Of course, if you do something as stupid as:
  svn cp Makefile some/path/foo.c
  # completely rewrite foo.c
  svn commit
then you'll have the history of `Makefile` melded into the
some/path/foo.c svn log, which is completely horribly wrong.

or if you do (which unlike the previous example isn't silly for so
many good reasons):
  cp bar.c foo.c
  svn add foo.c
  svn commit
then foo.c won't have bar.c history in its svn log.

--=20
=C2=B7O=C2=B7  Pierre Habouzit
=C2=B7=C2=B7O                                                madcoder@debia=
n.org
OOO                                                http://www.madism.org
From: Daniel Berlin
Date: Tuesday, December 11, 2007 - 2:09 pm

What?
We version directories too.
We don't do svn log for each file in the directory when you request a path.
We look at the history of the path, follow renames, etc.

When you change foo/bar/fred.c, we consider it a change to foo/bar and
foo/, and thus, they have new versions.

I'm not sure where you get this crazy notion that we do anything with
files when you ask about directories.
-

From: Matthieu Moy
Date: Tuesday, December 11, 2007 - 4:37 pm

Well, you don't get merge commit right with SVN, but that's a
different issue (svn 1.5 is supposed to have something about merge
history, I don't know how it's done ...). So, if by "history", you
mean how branches interferred together, obviously, SVN is bad at this.
But it's equally bad at "svn log dir/" and plain "svn log".

But to simplify, if you take a linear history (no merge commits),
"svn log dir/" give you the list of commits which changed something
inside "dir/". As pointed out in other messages, the way it's done is
really different from what git does. SVN does know a lot about
directories, and records a lot about them at commit time, while git
just considers them as file containers.

Year, CVS got this terribly wrong. IIRC, it just took the log for
individual messages, and mix them together, so a commit touching
multiple files would appear several times.

I've taken SVN as an extreme example, but at least bzr and mercurial
have an approach very similar to git.

So, to me, this particular point is something git obviously got right,
but not a point where git is so different from the others.

-- 
Matthieu
-

From: Linus Torvalds
Date: Tuesday, December 11, 2007 - 4:48 pm

Yeah, git just has higher goals.

The time history really matters (or rather, what I call the "shape" of 
history) is when you are trying to merge, and you get a merge conflict. 
That's when you want to do

	gitk master merge ^merge-base -- files-that-are-unmerged

and in fact this is such an important thing for me that there is a 
shorthand argument to do exactly that, ie:

	gitk --merge

which shows the commits that touched the unmerged files graphically *with* 
the history being correct (ie you don't just get a random log of "these 
changes happened", you get the real history of the two branches as it 

Sure, linear history is trivial. But it's also almost totally 
uninteresting.

			Linus
-

From: Nicolas Pitre
Date: Tuesday, December 11, 2007 - 10:47 am

And I bet this is the exact same issue as the repack one.

Do you still have the 2.1GB pack around?  I bet annotate would eat much 
less memory in that case.


Nicolas
-

From: Daniel Berlin
Date: Tuesday, December 11, 2007 - 10:53 am

I do not, but i could remake it in a few days if it would help
-

From: Nicolas Pitre
Date: Tuesday, December 11, 2007 - 11:01 am

Well, depending on the amount of RAM in your machine, you might even not 
be able to remake it at the moment.  I currently can't reproduce it 
myself due to the same out-of-memory issue.


Nicolas
-

From: Marco Costalba
Date: Tuesday, December 11, 2007 - 11:32 am

Speed of annotation is mainly due to getting the file history more
then calculating the actual annotation.

I don't know *how* file history is stored in the others scm, perhaps
is easier to retrieve, i.e. without a full walk across the
revisions...

In case you have qgit (especially the 2.0 version that is much faster
in this feature) I would be very interested to have annotation times
on this file. Indeed annotation times are shown splitted between file
history retrieval, based on something along the lines of "git log -p
-- <path>", and actual annotation calculation (fully internal at
qgit).

I would be interested in cold start and warm cache start (close the
annotation tab and start annotation again).


Thanks (a lot)
Marco
-

From: Daniel Berlin
Date: Tuesday, December 11, 2007 - 12:03 pm

It is stored in an easier format. However, can you not simply provide
side-indexes to do the annotation?

I guess that own't work in git because you can change history (in
other scm's, history is readonly so you could know the results for

I will try to do this.
-

From: Jason Sewall
Date: Tuesday, December 11, 2007 - 12:27 pm

I don't know how other scms work, but history is definitely readonly
in git - whatever sha1 you have that describes a commit was calculated
based on its ancestor commits.

If you have a commit's id, it will *always* refer to the same thing -
a tree state and its complete ancestry.
-

From: Marco Costalba
Date: Tuesday, December 11, 2007 - 12:14 pm

As Linus pointed out annotation in git is "much slower and much more
costly than just
having a local history view to begin with".

Indeed to annotate say kernel/sched.c

the time is spent by git while executing

git log -p -- kernel/sched.c

could be also 10X higher the the following annotation processing time
starting from the git log output.

Unfortunately my knowledge of git internals falls far far shorter then
guessing what could be done to increase the *one file* history case

Thanks. Very appreciated.
-

From: Daniel Barkalow
Date: Tuesday, December 11, 2007 - 12:46 pm

History in git is read-only. It's just that git lets you fork and move 
forward with something different. Each commit can never change (and, in 
fact, you'd have to badly break SHA1 to change it), but which commits are 
relevant to the history can change.

Keeping extra information is fine; at worst, it'll go irrelevant.

	-Daniel
*This .sig left intentionally blank*
-

From: Marco Costalba
Date: Tuesday, December 11, 2007 - 1:14 pm

Well, revisions never change, but history intended as revision's
parent information could and do changes when you use a path delimiter.
So does the graph that is a direct visualization of parent
information.

For a single revision (that modifies say 3 files) you can have at leat
3 different histories and acutally more if you want to visualize also
the history of the directories trees that owns the modified files.

You end up with a quite big number of different histories all showing
your revisions in different ways, according to the path delimiter you
use.

Perhaps the intended meaning of "changing histories" is this, and in
any case is this the reason you cannot (or has no sense to do) "save"
a single file history in git.

Marco
-

From: Florian Weimer
Date: Wednesday, December 12, 2007 - 3:36 am

A less unwieldy repository that shows the same problem is:

  svn://svn.debian.org/secure-testing/

It's annotating the data/CVE/list file that uses tons of memory.  I
guess you don't need to clone the full history to exhibit the problem.

-- 
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99
-

Previous thread: [PATCH] stupid typo in git-checkout.sh by Pierre Habouzit on Tuesday, December 11, 2007 - 9:09 am. (4 messages)

Next thread: Re: Something is broken in repack by Nicolas Pitre on Tuesday, December 11, 2007 - 11:57 am. (1 message)