I'm still a bit under the weather and do not have enough concentration to dig into the problem tonight, but I noticed that something in "next", most likely the delta-base-offset patchset, broke git-index-pack: $ X=ec0c3491753e115e1775256f6b7bd1bce4dea7cd $ wget http://www.kernel.org/pub/scm/git/git.git/objects/pack/pack-$X.pack $ ~/git-master/bin/git-index-pack pack-$X.pack ec0c3491753e115e1775256f6b7bd1bce4dea7cd $ git-index-pack pack-$X.pack fatal: packfile 'pack-ec0c3491753e115e1775256f6b7bd1bce4dea7cd.pack' has unresolved deltas -
Using the tip of the "next" branch (git version 1.4.2.4.gf9fe) I just cannot reproduce this problem at all. I always get a good index and ec0c3491753e115e1775256f6b7bd1bce4dea7cd back. Nicolas -
Hmph. I just got exactly the same breakage; could this be another 64-bit breakage? My breakage was on x86-64. -
I've been suspecting that since then as well. I indeed tested on i386. But reviewing the code I just can't find any obvious spot where 64-bit would be an issue, especially since your pack does not have any OFS_DELTA objects. Could you instrument the code at the end of index-pack.c:parse_pack_objects() to display how many deltas were actually resolved and how many were not? IOW is it a case of all or nothing, or is there an isolated case of corruption lurking somewhere? Nicolas -
fatal: packfile 'pack-ec0c3491753e115e1775256f6b7bd1bce4dea7cd.pack' has 18915 unresolved ref-deltas and 0 ofs-deltas among 21205 By the way, "Gaaaah". Is this find_delta() called from find_delta_children() doing the right thing? I wonder if this is open to accidental collisions?. If you have an object name whose last 12-bytes are all NUL and you have a pack offset whose bytes happens to be a good prefix for an object, what happens? -
Hmmm.... Interesting. Is it possible that sizeof(union delta_base) might not be equal to 20 It is filtered out later thusly: ... for (j = ref_first; j <= ref_last; j++) if (deltas[j].obj->type == OBJ_REF_DELTA) resolve_delta(&deltas[j], data, ...); So if a collision happens the object won't be of the right type and it is simply skipped. Nicolas -
Yes, on x86_64 this is 24 because of 8-byte alignment for longs:
$ echo 'unsigned x =3D sizeof(union{unsigned char sha1[20]; long offset;});=
' | gcc -S -x c -m64 -o - - | grep long
.long 24
$ echo 'unsigned x =3D sizeof(union{unsigned char sha1[20]; long offset;});=
' | gcc -S -x c -m32 -o - - | grep long
.long 20
Enough eyeballs made this bug shallow ;-) Thanks.
diff --git a/index-pack.c b/index-pack.c
index fffddd2..49b6efe 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -166,6 +166,7 @@ static void *unpack_raw_entry(unsigned l
case OBJ_REF_DELTA:
if (pos + 20 >= pack_limit)
bad_object(offset, "object extends past end of pack");
+ memset(delta_base, 0, sizeof(*delta_base));
hashcpy(delta_base->sha1, pack_base + pos);
pos += 20;
break;
@@ -290,6 +291,7 @@ static void resolve_delta(struct delta_e
bad_object(obj->offset, "failed to apply delta");
sha1_object(result, result_size, type, obj->sha1);
+ memset(&delta_base, 0, sizeof(delta_base));
hashcpy(delta_base.sha1, obj->sha1);
if (!find_delta_childs(&delta_base, &first, &last)) {
for (j = first; j <= last; j++)
@@ -365,6 +367,7 @@ static void parse_pack_objects(void)
if (obj->type == OBJ_REF_DELTA || obj->type == OBJ_OFS_DELTA)
continue;
+ memset(&base, 0, sizeof(base));
hashcpy(base.sha1, obj->sha1);
ref = !find_delta_childs(&base, &ref_first, &ref_last);
memset(&base, 0, sizeof(base));
-
Yeah.... but this is a bit wasteful since (especially on 32-bit archs) you just write to the same memory twice in a row. Nicolas -
Ah bummer. Then this is most likely the cause. And here's a simple
fix (Junio please confirm):
diff --git a/index-pack.c b/index-pack.c
index fffddd2..56c590e 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -23,6 +23,12 @@ union delta_base {
unsigned long offset;
};
+/*
+ * Even if sizeof(union delta_base) == 24 on 64-bit archs, we really want
+ * to memcmp() only the first 20 bytes.
+ */
+#define UNION_BASE_SZ 20
+
struct delta_entry
{
struct object_entry *obj;
@@ -211,7 +217,7 @@ static int find_delta(const union delta_
struct delta_entry *delta = &deltas[next];
int cmp;
- cmp = memcmp(base, &delta->base, sizeof(*base));
+ cmp = memcmp(base, &delta->base, UNION_BASE_SZ);
if (!cmp)
return next;
if (cmp < 0) {
@@ -232,9 +238,9 @@ static int find_delta_childs(const union
if (first < 0)
return -1;
- while (first > 0 && !memcmp(&deltas[first - 1].base, base, sizeof(*base)))
+ while (first > 0 && !memcmp(&deltas[first - 1].base, base, UNION_BASE_SZ))
--first;
- while (last < end && !memcmp(&deltas[last + 1].base, base, sizeof(*base)))
+ while (last < end && !memcmp(&deltas[last + 1].base, base, UNION_BASE_SZ))
++last;
*first_index = first;
*last_index = last;
@@ -312,7 +318,7 @@ static int compare_delta_entry(const voi
{
const struct delta_entry *delta_a = a;
const struct delta_entry *delta_b = b;
- return memcmp(&delta_a->base, &delta_b->base, sizeof(union delta_base));
+ return memcmp(&delta_a->base, &delta_b->base, UNION_BASE_SZ);
}
static void parse_pack_objects(void)
Nicolas
-
Why do you use "unsigned long" in the first place? For some structure like this, it sounds positively wrong. Pack-files should be architecture-neutral, which means that they shouldn't depend on word-size, and they should be in some neutral byte-order. Quite frankly, this all makes me go "Eww..". The original pack-file (well, v2) format was well-defined and had none of these issues. In contrast, the new code in 'next' is just _ugly_. And the thing about "ugly" is that it also tends to mean "fragile" and "buggy" and "hard to extend later". And maybe it's just me, but I consider unions to be bug-prone on their own. The "master" branch has exactly two unions: the "grep_expr" structure contains one (where the union member is clearly defined by the node type in that structure), and object.c has a "union any_object" that _literally_ exists as purely an allocation size issue (ie it is used _only_ to allocate the maximum size of any of the possible structures). In contrast, the new union introduced in "next" is just horrid. There's not even any way to know which member to use, except apparently that it expects that a SHA1 is never zero in the last 12 bytes. Which is probably true, but still - that's some ugly stuff. Is this something you want to bet a big project on? Linus -
Because offsets into packs are expressed as unsigned long everywhere
But they do. Please consider this code:
case OBJ_OFS_DELTA:
memset(delta_base, 0, sizeof(*delta_base));
c = pack_base[pos++];
base_offset = c & 127;
while (c & 128) {
base_offset += 1;
if (!base_offset || base_offset & ~(~0UL >> 7))
bad_object(offset, "offset value overflow for delta base object");
if (pos >= pack_limit)
bad_object(offset, "object extends past end of pack");
c = pack_base[pos++];
base_offset = (base_offset << 7) + (c & 127);
}
delta_base->offset = offset - base_offset;
if (delta_base->offset >= offset)
bad_object(offset, "delta base offset is out of bound");
break;
Do you see anything inerently wrong in this code? The above is already
64-bit ready such that it'll just work on 64-bit archs and will display
a sensible message if a 32-bit arch encounter a pack larger than 4GB.
This union should be looked at just like a sortable hash pointing to a
base object so that deltas with the same base object can be sorted
together. And the field to use is well defined of course: deltas with
sha1 to base use the sha1 member, deltas with offset to base use the
offset member. This hash, together with the delta type, constitute a
I don't see why not.
Nicolas
-
Until your work, that "unsigned long" was totally just an internal thing Right. The pack-file itself. But the code that actually _generates_ it .. and it sorts _differently_ on a big-endian vs little-endian thing, doesn't it? So now the sort order depends on endianness and/or wordsize. That just sounds really really wrong. Linus -
And would you please explain how my work changes that state of affairs? Sure. But who cares? The sorting is just there to 1) perform binary searches on the list of deltas based from a given object, and 2) find a Again, who cares? That ordering doesn't influence any data produced by the tool. It is an internal and private strategy to speed up the _local_ _searching_ process. It could be replaced by a dumb linear list walk if you wish and the end result i.e. the produced pack index would be exactly the same (with a significant slowdown notwitstanding). So let me summarize: - the union is a hash. - the hash is either an offset value or a sha1 digest. - this hash is used for fast object lookup _only_. - it does sort differently on big vs little endian machines. - but we don't care at all because - it is a private algorithmic thing that doesn't "bleed" into any _real_ data structure, and - it doesn't have any influence on the format of the end result. - it is only a runtime abstraction and nothing else. - It never gets into the pack nor the pack index themselves. Do you still have issues with that? Nicolas -
_I_ care. The new code is messy. It's fragile, and already showed one very fundamental bug which depended on architectures. These things matter. We have had very few bugs in git, and one of the reasons is (I believe) that we haven't had ad-hoc code. I get _very_ nervous when you mix up SHA1 names with somethign totally different without even a flag to say which one it is. That's just nasty. The fact that the code then behaves (and behave_d_) differently on different architectures is just a sign of the problem. "Who cares?" is not a good question to ask for a SCM. Linus -
My stance is that it is not fragile. Sure it had one bug that depended on an architecture difference, but so was commit ac58c7b18e about. The code also has many consistency checks all over so that it doesn't write out garbage if such bugs arise. And that fundamental bug was actually a trivial one that was caught right away by such consistency But there _is_ a flag for damn sake. Did you at least try to understand the code and not just skim over it from 10000 feet above? It is really simple: - if the found union content matches with a reference union initialized through the sha1 member then deltas[j].obj->type == OBJ_REF_DELTA must be true. - if the found union content matches with a reference union initialized through the sha1 member then deltas[j].obj->type == OBJ_OFS_DELTA must be true. - For all deltas with deltas[j].obj->type == OBJ_REF_DELTA there can not be more than one of them with the same union value. - For all deltas with deltas[j].obj->type == OBJ_OFS_DELTA there can not be more than one of them with the same union value. Does this mean that, with your own change to xdiff that has just been committed, you actually created a "problem"? Because this is a change that creates different behaviors whether a 32-bit or 64-bit architecture is used, Right? But of course not. We want it to behave differently on 64-bit than 32-bit. My code is in the _same_ camp since I explicitly want it to sort numbers differently whether it is a little endian or big endian machine. Please just try to understand why I'm claming this is not important in this very case. Please do me this favor. Nicolas -
I only looked at it from the patches, and the actual data structure, If you go back to that discussion, I actually pointed out several times that the whole bug _was_ actually introduced exactly because the xdiff code used things that behave differently depending on word-size. My suggestion for a _proper_ fix was to not use "unsigned long" for that, and the patch I suggested (and eventually got merged) was to use the _low_ bits of the hash, exactly because the low bits are the ones that act the No, we actually don't. Not for xdiff, at least. The last thing you want is for different architectures to get different results. It's horrible. It means that bugs are hard to reproduce, and it means that even code that is "tested" is actually tested only for a particular architecture. So the bug in xdiff was _exactly_ that somebody - totally incorrectly - Maybe the code is fine. Maybe the particular detail wasn't important. But the original code didn't have _any_ dependencies on things like structure alignment that caused it to do strange things. And dammit, the fact is, I think the new format is just worse. I think it was a good thing to have the full SHA1 in the pack-file. I think the code got less understandable, and had more special cases, just because now we have two totally different kinds of deltas. So maybe I'm reacting to the fact that I think the bug happened in the first place for a very simple reason: the data structure wasn't unambiguous any more. Linus -
Ehm, I think there's a little bit of confusion. The incorrect golden ratio prime selection for 64 bits machines was coalescing hash indexes into a very limited number of buckets, hence creating very bad performance on diff operations. The result of the diff would have been exacly the same, just Haha, not exactly. I had just forgot about the incoming 64 bits world at the time. - Davide -
But my point is, you would have been better off _without_ an algorithm that cared about the word-size at all, or with just using "uint32_t". See? Yes, a "unsigned long" has more bits for hashing on a 64-bit architecture. But that's totally the wrong way of thinking about it. YOU DO NOT WANT MORE BITS! You want the same damn answer regardless of architecture! A diff algorithm that gives different answers on a 32-bit LE architecture than on a 64-bit BE architecture is BROKEN. If I run on x86-64, I want the same answers I got on x86-32, and the same ones I get on ppc32. Anything else is SIMPLY NOT ACCEPTABLE! So the whole idea that you should have used 64-bit values was broken, broken, broken. You should never have had anything that cared, because anything that cares is by definition buggy. This is why we should use the _low_ bits. Never the high bits. Linus -
Yes. At the time I picked Knuth's hash function because it was simple and fast enough. But it needed special handling for different word sizes, Speaking in general, seen at the hash function level, of course an interface should not give different result for different word sizes or word endianess. Considering the diff algorithm as interface, as I said, the output was unaffected by the 64 bits word size. It was just very slow. - Davide -
Well, even the output may actually be affected, in the case of _real_ hash collisions (as opposed to just the hash _list_ collision that XDL_HASHLONG caused). So I actually think it would be better to have "uint32_t" as the hash value - because that would mean that all diffs (or, in the case of the block-algorithm, the deltas) are guaranteed to give the same results regardless of architecture. Right now, we actually generate a 64-bit hash value (BUT: for short lines, it's likely only _interesting_ in the low bits, so the high bits tend to have a very high likelihood of being zero). So hash collisions are different: on a 32-bit architecture, two lines may have the same hash, while on a 64-bit one, they are different. And together with some of the limiters we have (eg XDL_MAX_EQLIMIT) hash collisions can sometimes affect the output. Admittedly, in _practice_ this is really unlikely to affect anything (you'd get a valid diff in either case, they'd just possibly be subtly different, and the input data must be _really_ strange to even see that case), but I do think that the hash algorithm can matter. NOTE! I'm not talking about XDL_HASHLONG(), I'm talking about the xdl_hash_record() hash, which returns differently-sized hash results on 32-bit and 64-bit. And there are cases where we _only_ compare the hashes, and don't actually double-check the contents. So I think that in _practice_ you can't see differences between a 32-bit version and a 64-bit one, but the possibility is there. Using "uint32_t" instead of "unsigned long" to keep track of hashes would avoid that theoretical problem (and might actually make for better performance on 64-bit archtiectures, if only because of denser data structures and thus better cache behaviour). Linus -
The hash value (hence the hash bucket index) simply directs you to the bucket where a real record-compare loop is performed. Collisions here means only performance loss (you don't spread buckets enough), nothing more. So the different behaviour does not apply to the diff algo. But, it would apply if the knowledge of the hash indexing would be "exported", for example inside an external file. Think about a trivial DB file where you store hash buckets on file an you want to lookup records based on the store hash layout. In that case, of course, if the hash function that generated the DB bucket layout is different from the one that you use to get the bucket index at lookup time, you're in trouble. IOW if the hash function result is not "exported" is some way, it doesn't really matter if it's an 'unsigned long' or a 'uint32_t'. In the same way you cannot export binary structures using 'int' or 'long', and expect to be compatible over different architectures. - Davide -
As far as I can tell not all loops do a real "record-compare" thing. Some of the hash loops _only_ look at the hash, and as such a bad hash will do more than just cause bad performance, it will actually degrade the diff itself. Isn't that what XDL_MAX_EQLIMIT effectively does? Btw, the binary delta generator doesn't seem to have this issue at all: it uses "unsigned int" for the hash values, so the xdiff delta generation will give the same exact results on 32-bit and 64-bit architectures. Or was that one of the changes by Nico? (I only looked at the git version of that code) Linus -
The XDL_MAX_EQLIMIT is used to limit the search for equal records, in the record-discard phase. Note though, that at that point that "ha" value is a record-class ID (every different record/line in the input has a unique ID). Look at what xdl_classify_record() does. So in that case, XDL_HASHLONG can really simply be a bitmask. So comparing "ha" in the loop in there, does actually the right thing in any case (equal "ha" means really equal The binary diff in libxdiff uses a chaining hash, so even in that case it wouldn't have made a difference. I think Nico changed the hash to be a coalesced hash, and in that case it does change the output. - Davide -
The part you pointed out to me about "accidental collision" still bothers me somewhat. Right now we do not produce ref-delta and ofs-delta in the same stream, but if somebody did so then it would mean a disaster to have an accidental collision of an 8-byte offset value plus 12-byte traiing NUL and another base object whose object name happens to match that pattern. I am actually Ok if we say the code assumes one stream has only ref-delta or ofs-delta and never both. But then I suspect the first pass of parse_pack_objects() should make sure that assumption holds true for the pack being inspected and barf if it is not. Also the second pass do not have to run two find_delta_childs() calls per delta object because by that time we know which kind would never appear in the packfile. By the way can we call that find_delta_children() pretty please? -
Not really. The only effect that would have on the sorted list of delta entries -- such sorting used to bring all deltas with the same base object contigously -- is that those deltas might not be perfectly contigous wrt their base object. This is why there is a test to skip True, but the flexibility is worth having I think. It makes the thing I have no problem with that. Nicolas -
Ah, I misread the code that uses union actually checks the type in struct delta_entry (which embeds the union). There won't be any collision problem and you support both types at the same time just fine. And your patch to compare only the first 20-bytes makes sense (assuming ulong is always shorter than 20-bytes which I think is safe to assume). -
Does this sound fair (the code is yours, just asking about the
log message)?
If we really wanted to be purist, we could run comparison with
the union and obj->type as two keys, but I do not think it is
worth it.
-- >8 --
From: Nicolas Pitre <nico@cam.org>
Date: Tue, 17 Oct 2006 16:23:26 -0400
Subject: [PATCH] index-pack: compare the first 20-bytes of the key.
The "union delta_base" is a strange beast. It is a 20-byte
binary blob key to search a binary searchable deltas[] array,
each element of which uses it to represent its base object with
either a full 20-byte SHA-1 or an offset in the pack. Which
representation is used is determined by another field of the
deltas[] array element, obj->type, so there is no room for
confusion, as long as we make sure we compare the keys for the
same type only with appropriate length. The code compared the
full union with memcmp().
When storing the in-pack offset, the union was first cleared
before storing an unsigned long, so comparison worked fine.
On 64-bit architectures, however, the union typically is 24-byte
long; the code did not clear the remaining 4-byte alignment
padding when storing a full 20-byte SHA-1 representation. Using
memcmp() to compare the whole union was wrong.
This fixes the comparison to look at the first 20-bytes of the
union, regardless of the architecture. As long as ulong is
smaller than 20-bytes this works fine.
Signed-off-by: Junio C Hamano <junkio@cox.net>
---
index-pack.c | 14 ++++++++++----
1 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/index-pack.c b/index-pack.c
index fffddd2..56c590e 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -23,6 +23,12 @@ union delta_base {
unsigned long offset;
};
+/*
+ * Even if sizeof(union delta_base) == 24 on 64-bit archs, we really want
+ * to memcmp() only the first 20 bytes.
+ */
+#define UNION_BASE_SZ 20
+
struct delta_entry
{
struct object_entry *obj;
@@ -211,7 +217,7 @@ static int find_delta(const union ...Hi, Excuse me for joining the game, but why don't you just use the recently introduced hashcmp() for that purpose? AFAIU you do exactly that, you compare hashes. Ciao, Dscho -
Yes, and that is what I did originally. But that could lead to false assumptions (and this thread already proved this code has its share of false assumption leads already). The thing is that the memory chunk that is being compared is not always the same kind of hash as usually used with hashcmp(). Throughout the code hashcmp() is always used with a 20-byte sha1 digest. In this case it can be either a 20-byte sha1 digest, or a long offset value. And by using hashcmp() I would be afraid someone else could assume the hash is always a sha1 digest which it is not. Nicolas -
Signed-off-by: Nicolas Pitre <nico@cam.org> -
This is an in-core structure if I am not mistaken, and is never written out in the resulting pack file. The program is reading from .pack and building .idx that corresponds to it. I agree that it is ugly, but I do not hink using neutral byte-order in this part of the processing buys us anything in this particular Again I agree with the ugliness objection, and I raised the "expecting no collision" issue which Nico refuted later. The code is probably safe (not just safe in practice but safe in theory as well), although that would not change the fact that it is ugly X-<. Nico, could we have an un-uglied version please? -
The problem is that I just don't see what you find "ugly" in the thing.
Maybe it is just a matter of few missing comments?
Please let me walk you through bits of the code. I hope you'll
understand better why it is like it is after that.
In parse_pack_objects() you can read this:
/*
* First pass:
* - find locations of all objects;
* - calculate SHA1 of all non-delta objects;
* - remember base SHA1 for all deltas.
*/
So we reads through the whole pack to find where objects are located.
This information is stored in the objects[] array for all objects.
Entries for that array are:
struct object_entry
{
unsigned long offset;
enum object_type type;
enum object_type real_type;
unsigned char sha1[20];
};
ON that first pass only the offset and type can be determined for all
objects. The sha1 can be determined for non delta objects only.
There is a second array, deltas[], to record information about deltas as
they are found:
struct delta_entry
{
struct object_entry *obj;
union delta_base base;
};
The "obj" member points to the entry in the objects[] array for given
delta entry. And the "base" member, as you may guess, is a reference to
the base object for that delta.
So here we are with that dreaded union.
Now why isn't it a second struct object_entry pointer instead? Because
one kind of delta allows for its base object to appear later in the pack
and all we've got for identifying that base object is a sha1 reference
since we cannot predict where the base object will be. With the other
object type we _could_ find out about the location of the base object
right away. But that would mean adding a special case for those objects
through a different code path unnecessarily. Instead, the base offset
is treated just like a very special sha1 value and the same array is
used. So maybe the comment that says "remember base SHA1 for all ...