Recent press[1] is talking about sha-1 collisions again. Even though
the reported attack was against a weakened variant of sha-1 (64, not 80,
passes), it serves as a useful point to start talking about the future.I argue that sha-256 is better suited to git's purposes, and to modern
machines, than sha-1.Upsides to sha-256:
* not just a bit increase, but a stronger algorithm. there is more
mixing, doing a more-than-incrementally better job at avoiding collisions.
* the bit increase itself provides more hash space, theoretically
reducing collisions.
* properly aligned, a set of 32-byte hashes won't straddle CPU cachelines.Downsides to sha-256:
* git protocol/storage format change implications.
* increase in storage size (20 to 32 bytes per hash).
* fewer hand-optimized algorithm variants have been implemented.
* likely more CPU cycles per hash, though I haven't measured.Wikimedia page has lotsa info:
http://en.wikipedia.org/wiki/Secure_Hash_AlgorithmMaybe sha-256 could be considered for the next major-rev of git?
Jeff
[1] http://www.heise-security.co.uk/news/77244
-
According to a quick test using "openssl speed", it's a factor of two
to four, depending on the input size (the difference is lessAnd in 2008, you'd have to rewrite history again, to use the next
"stronger" hash function? Do you think that's really necessary or
desirable? Most users will have good control over what data enters
their repositories, so they can spot the evil twins thanks to their
high-entropy contents. Obviously, a second preimage attack would
mattr, but even for MD5, we aren't close to that one AFAIK.
-
Not sure, but _if_ we want it we should do it sooner rather than
later.
--
Krzysztof Halasa
-
Modifying git-convert-objects.c to rewrite the regular sha1 into a sha256
should be fairly straightforward. It's never been used since the early
days (and has limits like a maximum of a million objects etc that can need
fixing), but it shouldn't be "fundamentally hard" per se.Linus
-
Hi,
But what about signed tags? (This issue has come up before, but never has
been adressed.)I also thought about supporting hybrid hashes, i.e. that older objects
still can be hashed with SHA-1. Alas, a simple thought experiment
demonstrates how silly that idea is: most of the objects will not change
between two revisions, and they'd have to be rehashed with SHA-256 (or
whatever we decide upon) anyway, so hybrids would do no good.A better idea would be to increment the repository version, and expect
SHA-1 for version 1, SHA-256 for version >= 2.However, I could imagine that we do not need this huge change (it would
break _many_ setups). The breakthrough was announced last Tuesday, and it
involved 75% payload, i.e. to fake a new -- say -- git.c, one would need
to enlarge git.c by a factor 4, and you would see a lot of gibberish
inside some comment. (Note that I did not listen to the talk myself, this
is all deducted from the scarce information which is available via the
'net.)Even if the breakthrough really comes to full SHA-1, you still have to add
_at least_ 20 bytes of gibberish. Which would be harder to spot, but it
would be spotted.This made me think about the use of hashes in git. Why do we need a hash
here (in no particular order):1) integrity checking,
2) fast lookup,
3) identifying objects (related to (2)),
4) trust.Except for (4), I do not see why SHA-1 -- even if broken -- should not be
adequate. It is not like somebody found out that all JPGs tend to have
similar hashes so that collisions are more likely.And thinking about trust: The hash is augmented by thinking persons. It is
not like you blindly trust a person forever. You build up trust, and once
you were failed, the trust is lost, and very hard to build up again. So,
you just would try to get all objects again from somebody you still trust,
and never pull from the loser^H^H^H^H^Huntrusted person again. Ever.Besides, as has been pointed out several times, a di...
Signed tags fundamentally have to be re-signed. That's by design: if
somebody could rewrite an archive and signed tags would still be accepted
to have the right signature, that would be a _serious_ sign of a totally
broken security model.Indeed. Hybrids would not only do no good, but they would actually
_actively_ hurt things, because they'd fundamentally break the notion that
the hash being identical means that the object (blob, tree, subtree) is
the same.So allowing two names for the same object is very fundamentally wrong in
Yes. It would be reasonably painful for users, though (as Krzysztof
correctly points out). Every client would have to convert when aYeah, I don't think this is at all critical, especially since git really
on a security level doesn't _depend_ on the hashes being cryptographically
secure. As I explained early on (ie over a year ago, back when the whole
design of git was being discussed), the _security_ of git actually depends
on not cryptographic hashes, but simply on everybody being able to secure
their own _private_ repository.So the only thing git really _requires_ is a hash that is _unique_ for the
developer (and there we are talking not of an _attacker_, but a benign
participant).That said, the cryptographic security of SHA-1 is obviously a real bonus.
So I'd be disappointed if SHA-1 can be broken more easily (and I obviously
already argued against using MD5, exactly because generating duplicates of
that is fairly easy). But it's not "fundamentally required" in git per se.[ The one exception: the "signed tags" security does depend on the hashes
being cryptographically strong. So again, breaking SHA-1 would not mean
that git stops working, but it _would_ potentially mean that if you
don't trust your own _private_ repository, the signed tag may no longerCorrect. I'm pretty sure we had exactly this discussion around May 2005,
but I'm too lazy to search ;)Linus
-
--On Sunday, August 27, 2006 03:35:20 PM -0700 Linus Torvalds
just to double check.
if you already have a file A in git with hash X is there any condition
where a remote file with hash X (but different contents) would overwrite
the local version?what would happen if you ended up with two packs that both contained a file
with hash X but with different contents and then did a repack on them?
(either packs from different sources, or packs downloaded through some
mechanism other then the git protocol are two ways this could happen that I
can think of)David Lang
-
Nope. If it has the same SHA1, it means that when we receive the object
from the other end, we will _not_ overwrite the object we already have.So what happens is that if we ever see a collision, the "earlier" object
in any particular repository will always end up overriding. But note that
"earlier" is obviously per-repository, in the sense that the git object
network generates a DAG that is not fully ordered, so while different
repositories will agree about what is "earlier" in the case of direct
ancestry, if the object came through separate and not directly related
branches, two different repos may obviously have gotten the two objects in
different order.However, the "earlier will override" is very much what you want from a
security standpoint: remember that the git model is that you should
primarily trust only your _own_ repository. So if you do a "git pull", the
new incoming objects are by definition less trustworthy than the objects
you already have, and as such it would be wrong to allow a new object to
replace an old one.So you have two cases of collision:
- the inadvertent kind, where you somehow are very very unlucky, and two
files end up having the same SHA1. At that point, what happens is that
when you commit that file (or do a "git-update-index" to move it into
the index, but not committed yet), the SHA1 of the new contents will be
computed, but since it matches an old object, a new object won't be
created, and the commit-or-index ends up pointing to the _old_ object.You won't notice immediately (since the index will match the old object
SHA1, and that means that something like "git diff" will use the
checked-out copy), but if you ever do a tree-level diff (or you
do a clone or pull, or force a checkout) you'll suddenly notice that
that file has changed to something _completely_ different than what you
expected. So you would generally notice this kind of collision fairly
quickly.In relat...
Hi,
The only notable exception I can think of: "git fetch -k". If you then try
to retrieve the bogus object, it will return the one of whichever pack was
returned first be readdir(). (If I read the source correctly.)Now, the cases are rare where you do both "git fetch -k" and "git repack
-a -d" (the latter of which _could_ leave a hole in the directory which
_could_ make the next fetched pack fill that hole, which in turn _could_
make readdir() return that pack before more "senior" packs) in the same
repository, but in these cases, yes, you could end up with the copy of the
remote side.You'd need to explicitely use "git fetch -k", though.
Ciao,
Dscho-
Good point.
I didn't even think of "-k", since I mentally put that in the "initial
clone usage only" category, but yeah, if people use it for incremental
updates too, that could indeed cause ambiguity in which object to use when
the other end does something bad.Linus
-
Actually I think we may see it when somebody tries to put a real
example of conflicting SHA-1 pair into git repository.
--
Krzysztof Halasa
-
Well, by definition, I wouldn't call that "inadvertent" ;)
Anyway, the way to do it (if you want to use git to document SHA1 hash
mismatches) is to just check the files that have an identical SHA1 in. It
will magically work!Why? Because a git SHA1 is actually _not_ the SHA1 of the file itself,
it's the SHA1 of the file _with_the_git_header_added_.So if you find two files that have the same SHA1, they would also have to
have the same length in order to actually generate the same object name.
If they have different lenths, you can just check them into git, and
they'll get two different git SHA1 names and you'll have a cool git
archive that when you check the files out, they checked-out files will
share the same SHA1 ;)Linus
-
Well, conflicting files will most probably have the same size,
like with MD5 cases :-)
--
Krzysztof Halasa
-
That's only true for the much easier injection case where you generate
_both_ files together.From an external git hash-attack standpoint, that's not a very useful
case. It's much more useful if you can make a new file that has a hash
that matches a given old file, and in that case, the filelengths are
likely not the same.Linus
-
This concept breaks down somewhat if you are pulling from two
repositories (one good and one evil). If I pull from the evil repo
first, that will become my "earlier" object, and I will never get the
colliding object from the good repo.Executing such an attack might not be that hard, either (once we get
over that little hump of creating collisions at will!). The owner of
'evil' has to know a SHA1 that will be in 'good' before it makes it to
'good'. However, I imagine we frequently see SHA1s migrate from more
central repos (like .../torvalds/linux-2.6.git) to less central ones
(subsystem / port maintainers, etc).-Peff
-
Sure. But if you are pulling from an untrusted source, you'd better at
least check the result.In fact, that's partly why "git pull" will do a diffstat after the pull.
Exactly to force people to at least be minimally aware of what they
pulled. And "gitk ORIG_HEAD.." is a great thing to always run when you
pull from somebody you don't know and trust really well.Of course, that all was done mostly not because I don't "trust" the people
I work with, but more because I didn't always trust that they'd do the
right thing with git (ie they'd screw up the repo not because they were
evil, but because they made a mistake).So even if you pull from an "evil" repo first, and you somehow get a "bad"
object, the point is, the bad object _should_ be the one that overrides.Why? Because once you find out that the evil repo was bad (which you'll
eventually find simply because it caused some bug - if the evil repo only
helps you, it's obviously not evil at all), what you need to do is reset
to _before_ the evil repo happened, do a "git repack -a -d" and finally a
"git prune" to clean out all the bad cruft, and then pull the good repo
without pulling the bad one first.After that, you apologize to everybody for screwing up and pulling from
somebody you didn't trust, and then ask them to re-clone (or give them the
appropriate "git reset" + "git repack -a" + "git prune" + "git pull"
sequence so that they can fix their existing repos).The point being, a hash attack is really no worse than an attack that
fools you into applying a really bad diff (regardless of SCM), and it's a
hell of a lot harder to do. Both a hash attack and a diff attack mean that
the person merging data should either trust his source or inspect the end
result.Anybody who just blindly accepts data from untrusted sources is screwed in
so many other ways that the hash attack simply isn't even on the radar.Linus
-
I completely agree; however, even discussing "earlier takes precedence"
entails that you are somehow pulling from an untrusted source. I just
wanted to point out that "earlier" does not always mean "more trusted
than the thing you're pulling now" (since it might have just been pulledAgreed.
-Peff
-
Btw, this is obviously only true for the native git protocol itself.
If the attacker can fool you into generating the new file _yourself_, he
can cause your checked-out copy to not match the git object database any
more.In other words, one "interesting" attack vector is to feed you the
colliding SHA1 not through a git-to-git transfer, but by generating a
_patch_ that when applied will generate the collision, so that when you
then commit that patch, you get something else than you expected.And _this_ is where it's important that the hash that git uses be a
non-trivial one - ie we don't want people to be able to generate two files
that look superficially "ok".So here's the rule: If you ever get a patch that looks like line-noise,
especially from somebody you don't trust, DON'T APPLY IT!Now, that is obviously something you should never do _regardless_ of any
git issues, so I don't think this is really a problem either. If you apply
patches from people you don't have a good reason to trust without
sanity-checking them, you deserve whatever you get, and quite frankly, a
SHA1 hash collision is the _least_ of your problems ;)(This ends up boiling down to one common issue: it's generally _much_
easier to attack a project through _other_ means than through a hash
collision. And I pretty much guarantee that that is the case even if we
were to use a much weaker hash, like MD5. Hash collisions fundamentally
just aren't good attack vectors, and it's a hell of a lot easier to try
to insert bad code by other means)Linus
-
Sure. I was rather thinking of rapidly increasing number of git
repositories, each with growing history.
--
Krzysztof Halasa
-
