Recent press[1] is talking about sha-1 collisions again. Even though the reported attack was against a weakened variant of sha-1 (64, not 80, passes), it serves as a useful point to start talking about the future. I argue that sha-256 is better suited to git's purposes, and to modern machines, than sha-1. Upsides to sha-256: * not just a bit increase, but a stronger algorithm. there is more mixing, doing a more-than-incrementally better job at avoiding collisions. * the bit increase itself provides more hash space, theoretically reducing collisions. * properly aligned, a set of 32-byte hashes won't straddle CPU cachelines. Downsides to sha-256: * git protocol/storage format change implications. * increase in storage size (20 to 32 bytes per hash). * fewer hand-optimized algorithm variants have been implemented. * likely more CPU cycles per hash, though I haven't measured. Wikimedia page has lotsa info: http://en.wikipedia.org/wiki/Secure_Hash_Algorithm Maybe sha-256 could be considered for the next major-rev of git? Jeff [1] http://www.heise-security.co.uk/news/77244 -
Not sure, but _if_ we want it we should do it sooner rather than later. -- Krzysztof Halasa -
Modifying git-convert-objects.c to rewrite the regular sha1 into a sha256 should be fairly straightforward. It's never been used since the early days (and has limits like a maximum of a million objects etc that can need fixing), but it shouldn't be "fundamentally hard" per se. Linus -
Sure. I was rather thinking of rapidly increasing number of git repositories, each with growing history. -- Krzysztof Halasa -
Hi, But what about signed tags? (This issue has come up before, but never has been adressed.) I also thought about supporting hybrid hashes, i.e. that older objects still can be hashed with SHA-1. Alas, a simple thought experiment demonstrates how silly that idea is: most of the objects will not change between two revisions, and they'd have to be rehashed with SHA-256 (or whatever we decide upon) anyway, so hybrids would do no good. A better idea would be to increment the repository version, and expect SHA-1 for version 1, SHA-256 for version >= 2. However, I could imagine that we do not need this huge change (it would break _many_ setups). The breakthrough was announced last Tuesday, and it involved 75% payload, i.e. to fake a new -- say -- git.c, one would need to enlarge git.c by a factor 4, and you would see a lot of gibberish inside some comment. (Note that I did not listen to the talk myself, this is all deducted from the scarce information which is available via the 'net.) Even if the breakthrough really comes to full SHA-1, you still have to add _at least_ 20 bytes of gibberish. Which would be harder to spot, but it would be spotted. This made me think about the use of hashes in git. Why do we need a hash here (in no particular order): 1) integrity checking, 2) fast lookup, 3) identifying objects (related to (2)), 4) trust. Except for (4), I do not see why SHA-1 -- even if broken -- should not be adequate. It is not like somebody found out that all JPGs tend to have similar hashes so that collisions are more likely. And thinking about trust: The hash is augmented by thinking persons. It is not like you blindly trust a person forever. You build up trust, and once you were failed, the trust is lost, and very hard to build up again. So, you just would try to get all objects again from somebody you still trust, and never pull from the loser^H^H^H^H^Huntrusted person again. Ever. Besides, as has been pointed out several times, a ...
Signed tags fundamentally have to be re-signed. That's by design: if somebody could rewrite an archive and signed tags would still be accepted to have the right signature, that would be a _serious_ sign of a totally broken security model. Indeed. Hybrids would not only do no good, but they would actually _actively_ hurt things, because they'd fundamentally break the notion that the hash being identical means that the object (blob, tree, subtree) is the same. So allowing two names for the same object is very fundamentally wrong in Yes. It would be reasonably painful for users, though (as Krzysztof correctly points out). Every client would have to convert when a Yeah, I don't think this is at all critical, especially since git really on a security level doesn't _depend_ on the hashes being cryptographically secure. As I explained early on (ie over a year ago, back when the whole design of git was being discussed), the _security_ of git actually depends on not cryptographic hashes, but simply on everybody being able to secure their own _private_ repository. So the only thing git really _requires_ is a hash that is _unique_ for the developer (and there we are talking not of an _attacker_, but a benign participant). That said, the cryptographic security of SHA-1 is obviously a real bonus. So I'd be disappointed if SHA-1 can be broken more easily (and I obviously already argued against using MD5, exactly because generating duplicates of that is fairly easy). But it's not "fundamentally required" in git per se. [ The one exception: the "signed tags" security does depend on the hashes being cryptographically strong. So again, breaking SHA-1 would not mean that git stops working, but it _would_ potentially mean that if you don't trust your own _private_ repository, the signed tag may no longer Correct. I'm pretty sure we had exactly this discussion around May 2005, but I'm too lazy to search ;) Linus -
--On Sunday, August 27, 2006 03:35:20 PM -0700 Linus Torvalds just to double check. if you already have a file A in git with hash X is there any condition where a remote file with hash X (but different contents) would overwrite the local version? what would happen if you ended up with two packs that both contained a file with hash X but with different contents and then did a repack on them? (either packs from different sources, or packs downloaded through some mechanism other then the git protocol are two ways this could happen that I can think of) David Lang -
Nope. If it has the same SHA1, it means that when we receive the object from the other end, we will _not_ overwrite the object we already have. So what happens is that if we ever see a collision, the "earlier" object in any particular repository will always end up overriding. But note that "earlier" is obviously per-repository, in the sense that the git object network generates a DAG that is not fully ordered, so while different repositories will agree about what is "earlier" in the case of direct ancestry, if the object came through separate and not directly related branches, two different repos may obviously have gotten the two objects in different order. However, the "earlier will override" is very much what you want from a security standpoint: remember that the git model is that you should primarily trust only your _own_ repository. So if you do a "git pull", the new incoming objects are by definition less trustworthy than the objects you already have, and as such it would be wrong to allow a new object to replace an old one. So you have two cases of collision: - the inadvertent kind, where you somehow are very very unlucky, and two files end up having the same SHA1. At that point, what happens is that when you commit that file (or do a "git-update-index" to move it into the index, but not committed yet), the SHA1 of the new contents will be computed, but since it matches an old object, a new object won't be created, and the commit-or-index ends up pointing to the _old_ object. You won't notice immediately (since the index will match the old object SHA1, and that means that something like "git diff" will use the checked-out copy), but if you ever do a tree-level diff (or you do a clone or pull, or force a checkout) you'll suddenly notice that that file has changed to something _completely_ different than what you expected. So you would generally notice this kind of collision fairly quickly. In ...
Btw, this is obviously only true for the native git protocol itself. If the attacker can fool you into generating the new file _yourself_, he can cause your checked-out copy to not match the git object database any more. In other words, one "interesting" attack vector is to feed you the colliding SHA1 not through a git-to-git transfer, but by generating a _patch_ that when applied will generate the collision, so that when you then commit that patch, you get something else than you expected. And _this_ is where it's important that the hash that git uses be a non-trivial one - ie we don't want people to be able to generate two files that look superficially "ok". So here's the rule: If you ever get a patch that looks like line-noise, especially from somebody you don't trust, DON'T APPLY IT! Now, that is obviously something you should never do _regardless_ of any git issues, so I don't think this is really a problem either. If you apply patches from people you don't have a good reason to trust without sanity-checking them, you deserve whatever you get, and quite frankly, a SHA1 hash collision is the _least_ of your problems ;) (This ends up boiling down to one common issue: it's generally _much_ easier to attack a project through _other_ means than through a hash collision. And I pretty much guarantee that that is the case even if we were to use a much weaker hash, like MD5. Hash collisions fundamentally just aren't good attack vectors, and it's a hell of a lot easier to try to insert bad code by other means) Linus -
This concept breaks down somewhat if you are pulling from two repositories (one good and one evil). If I pull from the evil repo first, that will become my "earlier" object, and I will never get the colliding object from the good repo. Executing such an attack might not be that hard, either (once we get over that little hump of creating collisions at will!). The owner of 'evil' has to know a SHA1 that will be in 'good' before it makes it to 'good'. However, I imagine we frequently see SHA1s migrate from more central repos (like .../torvalds/linux-2.6.git) to less central ones (subsystem / port maintainers, etc). -Peff -
Sure. But if you are pulling from an untrusted source, you'd better at least check the result. In fact, that's partly why "git pull" will do a diffstat after the pull. Exactly to force people to at least be minimally aware of what they pulled. And "gitk ORIG_HEAD.." is a great thing to always run when you pull from somebody you don't know and trust really well. Of course, that all was done mostly not because I don't "trust" the people I work with, but more because I didn't always trust that they'd do the right thing with git (ie they'd screw up the repo not because they were evil, but because they made a mistake). So even if you pull from an "evil" repo first, and you somehow get a "bad" object, the point is, the bad object _should_ be the one that overrides. Why? Because once you find out that the evil repo was bad (which you'll eventually find simply because it caused some bug - if the evil repo only helps you, it's obviously not evil at all), what you need to do is reset to _before_ the evil repo happened, do a "git repack -a -d" and finally a "git prune" to clean out all the bad cruft, and then pull the good repo without pulling the bad one first. After that, you apologize to everybody for screwing up and pulling from somebody you didn't trust, and then ask them to re-clone (or give them the appropriate "git reset" + "git repack -a" + "git prune" + "git pull" sequence so that they can fix their existing repos). The point being, a hash attack is really no worse than an attack that fools you into applying a really bad diff (regardless of SCM), and it's a hell of a lot harder to do. Both a hash attack and a diff attack mean that the person merging data should either trust his source or inspect the end result. Anybody who just blindly accepts data from untrusted sources is screwed in so many other ways that the hash attack simply isn't even on the radar. Linus -
I completely agree; however, even discussing "earlier takes precedence" entails that you are somehow pulling from an untrusted source. I just wanted to point out that "earlier" does not always mean "more trusted than the thing you're pulling now" (since it might have just been pulled Agreed. -Peff -
Actually I think we may see it when somebody tries to put a real example of conflicting SHA-1 pair into git repository. -- Krzysztof Halasa -
Well, by definition, I wouldn't call that "inadvertent" ;) Anyway, the way to do it (if you want to use git to document SHA1 hash mismatches) is to just check the files that have an identical SHA1 in. It will magically work! Why? Because a git SHA1 is actually _not_ the SHA1 of the file itself, it's the SHA1 of the file _with_the_git_header_added_. So if you find two files that have the same SHA1, they would also have to have the same length in order to actually generate the same object name. If they have different lenths, you can just check them into git, and they'll get two different git SHA1 names and you'll have a cool git archive that when you check the files out, they checked-out files will share the same SHA1 ;) Linus -
Well, conflicting files will most probably have the same size, like with MD5 cases :-) -- Krzysztof Halasa -
That's only true for the much easier injection case where you generate _both_ files together. From an external git hash-attack standpoint, that's not a very useful case. It's much more useful if you can make a new file that has a hash that matches a given old file, and in that case, the filelengths are likely not the same. Linus -
Hi, The only notable exception I can think of: "git fetch -k". If you then try to retrieve the bogus object, it will return the one of whichever pack was returned first be readdir(). (If I read the source correctly.) Now, the cases are rare where you do both "git fetch -k" and "git repack -a -d" (the latter of which _could_ leave a hole in the directory which _could_ make the next fetched pack fill that hole, which in turn _could_ make readdir() return that pack before more "senior" packs) in the same repository, but in these cases, yes, you could end up with the copy of the remote side. You'd need to explicitely use "git fetch -k", though. Ciao, Dscho -
Good point. I didn't even think of "-k", since I mentally put that in the "initial clone usage only" category, but yeah, if people use it for incremental updates too, that could indeed cause ambiguity in which object to use when the other end does something bad. Linus -
According to a quick test using "openssl speed", it's a factor of two to four, depending on the input size (the difference is less And in 2008, you'd have to rewrite history again, to use the next "stronger" hash function? Do you think that's really necessary or desirable? Most users will have good control over what data enters their repositories, so they can spot the evil twins thanks to their high-entropy contents. Obviously, a second preimage attack would mattr, but even for MD5, we aren't close to that one AFAIK. -
