Re: [PATCH] fix diff-delta bad memory access

Previous thread: Re: Implementing branch attributes in git config by Jakub Narebski on Wednesday, May 10, 2006 - 12:03 pm. (1 message)

Next thread: [Patch] git-cvsimport: tiny fix by Elrond on Wednesday, May 10, 2006 - 1:37 pm. (4 messages)
To: Junio C Hamano <junkio@...>
Cc: Alex Riesen <raa.lkml@...>, Randal L. Schwartz <merlyn@...>, <git@...>
Date: Wednesday, May 10, 2006 - 12:26 pm

It cannot be assumed that the given buffer will never be moved when
shrinking the allocated memory size with realloc(). So let's ignore
that optimization for now.

This patch makes Electric Fence happy on Linux.

Signed-off-by: Nicolas Pitre <nico@cam.org>

---

I can't tell if that fixes it on BSD and Cygwin though.

diff --git a/Makefile b/Makefile
diff --git a/diff-delta.c b/diff-delta.c
index c618875..25a798d 100644
--- a/diff-delta.c
+++ b/diff-delta.c
@@ -199,7 +199,6 @@ struct delta_index * create_delta_index(
entry->next = hash[i];
hash[i] = entry++;
hash_count[i]++;
- entries--;
}
}

@@ -230,10 +229,6 @@ struct delta_index * create_delta_index(
}
free(hash_count);

- /* If we didn't use all hash entries, free the unused memory. */
- if (entries)
- index = realloc(index, memsize - entries * sizeof(*entry));
-
return index;
}

-

To: Nicolas Pitre <nico@...>
Cc: Junio C Hamano <junkio@...>, Alex Riesen <raa.lkml@...>, Randal L. Schwartz <merlyn@...>, <git@...>
Date: Wednesday, May 10, 2006 - 1:00 pm

Yeah, that was totally bogus. It would re-allocate _part_ of an
allocation, but that allocation contained not just the index, but all the
hashes and the hash entries too!

This has nothing to do with moving a buffer - it has everything to do with
the fact that you shrunk a buffer that still contained all the other
buffers: you shrunk the "mem" allocation to fit just the first part of it,
and entirely ignored the "hash" and "entry" parts of it.

Btw, I think that whole "allocate everything in one allocation" thing is
potentially bogus even the way it is now, if the alignment constraints of
"index", "hash" and "entry" are different.

When you do

..
index = mem;
mem = index + 1;
hash = mem;
mem = hash + hsize;
entry = mem;
..

it's perfectly fine for "index", but "hash" and "entry" end up having
alignments that depend on the size/alignment of "index" (and for "entry"
on "hash").

So if their alignment requirements are different, you're basically
screwed.

It may work in practice (maybe they all align on pointer boundaries), but
it's damn scary. You should re-consider, or at least make that code be a
lot safer (like actually taking alignment into consideration, both for
total size and for the offset calculations).

That could be done by using unions or something.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Alex Riesen <raa.lkml@...>, Randal L. Schwartz <merlyn@...>, <git@...>
Date: Wednesday, May 10, 2006 - 1:27 pm

No.

The initial allocation assumes a perfectly even distribution of entries
in which case the entry array would be all populated.

When there are repeated bytes then consecutive blocks will have the same
hash, and in that case keeping only the first one is useful.

Which means that the entry pointer won't get to the end of the allocated
space for all entries and I naively assumed that using realloc would
only mark the allocated memory as smaller than it originally was without
actually copying any of the remaining data, which is what happened in
most cases but evidently not always.

But if the buffer moves the hash array containing _pointers_ to hash

Yeah... I might just do a separate allocation for hash entries as well.

Or maybe even drop the hash chaining altogether (now that the code does
backward matching that might be worth the code simplification).

Nicolas
-

To: Nicolas Pitre <nico@...>
Cc: Junio C Hamano <junkio@...>, Alex Riesen <raa.lkml@...>, Randal L. Schwartz <merlyn@...>, <git@...>
Date: Wednesday, May 10, 2006 - 3:01 pm

Btw, Nico, that rabin hash code is _extremely_ confusing.

The hash entry pointers point to "data + RABIN_WINDOW", and then to make
things even _more_ confusing, the hash calculation code is actually offset
by one, so it will have computed the hash with

val = ((val << 8) | data[i]) ^ T[val >> RABIN_SHIFT];

where "i" goes from _1_ to RABIN_WINDOW instead of 0..WINDOW-1.

So, if I read that correctly, the "entry->ptr" actually points not to the
beginning of the data that was hashed, or even the end, but literally to
the last byte of the data that was hashed in that window.

Isn't that just _really_ confusing?

Or is there some sense to this?

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Alex Riesen <raa.lkml@...>, Randal L. Schwartz <merlyn@...>, <git@...>
Date: Wednesday, May 10, 2006 - 3:43 pm

Yes, ptr points to the last byte of the window for given hash value.

This is so because in the matching loop the window is scrolled byte by
byte and the new hash value is always made of ROBIN_WINDOW-1 bytes which
already have been processed (most probably added as literal bytes in the
delta buffer) plus one new byte. So it makes sense to start forward
byte matching only from that last byte to find the longest source area
to match, especially since all the other bytes in the window are likely
to be identical already and comparing them repeatedly for entries with
the same hash would be wasteful in most cases.

Further down, once the best offset with the longest match in the source
buffer has been found, backward matching is performed to remove as much
literal bytes that were added to the delta output as possible, which is
most likely to catch the whole Robin window that hasn't been compared
previously, but in that case the window content is compared only once.

And the reason why the reference hash is computed with an offset of 1 to
RABIN_WINDOW inclusive in create_delta_index() is to allow for quick
initialization of the Rabin window _outside_ of the main loop in
create_delta(). There is a comment to that effect near the top of
create_delta_index but probably a small reminder should be added in the
index loop as well.

Nicolas
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Alex Riesen <raa.lkml@...>, Randal L. Schwartz <merlyn@...>, <git@...>
Date: Wednesday, May 10, 2006 - 3:57 pm

And of course s/robin/rabin/ (I can't type RABIN without having my
fingers decide on ROBIN by themselves).

Nicolas
-

To: Nicolas Pitre <nico@...>
Cc: Junio C Hamano <junkio@...>, Alex Riesen <raa.lkml@...>, Randal L. Schwartz <merlyn@...>, <git@...>
Date: Wednesday, May 10, 2006 - 1:18 pm

Actually, no, you got that part right. Mea culpa. It really only ended up
being a problem when the area was moved, since the pointers into that area
weren't updated.

The alignment issue is real, but looking at the different structures,
they'll all have pointers that end up being the real (and only) alignment
constraint, so as a result, on any reasonable machine they all have the
same alignment and things are fine.

Junio - pls apply Nico's patch asap. It's correct, and fixes a really
nasty problem. And I'll withdraw my other worries as "not consequential".

Linus
-

Previous thread: Re: Implementing branch attributes in git config by Jakub Narebski on Wednesday, May 10, 2006 - 12:03 pm. (1 message)

Next thread: [Patch] git-cvsimport: tiny fix by Elrond on Wednesday, May 10, 2006 - 1:37 pm. (4 messages)