Re: Starting to think about sha-256?

Previous thread: Re: Problem with pack by Sergio Callegari on Sunday, August 27, 2006 - 10:45 am. (4 messages)

Next thread: Re: [PATCH 0/7] gitweb: Cleanups, fixes and small improvements by Jakub Narebski on Sunday, August 27, 2006 - 1:21 pm. (1 message)
From: Jeff Garzik
Date: Sunday, August 27, 2006 - 10:56 am

Recent press[1] is talking about sha-1 collisions again.  Even though 
the reported attack was against a weakened variant of sha-1 (64, not 80, 
passes), it serves as a useful point to start talking about the future.

I argue that sha-256 is better suited to git's purposes, and to modern 
machines, than sha-1.

Upsides to sha-256:
* not just a bit increase, but a stronger algorithm.  there is more 
mixing, doing a more-than-incrementally better job at avoiding collisions.
* the bit increase itself provides more hash space, theoretically 
reducing collisions.
* properly aligned, a set of 32-byte hashes won't straddle CPU cachelines.

Downsides to sha-256:
* git protocol/storage format change implications.
* increase in storage size (20 to 32 bytes per hash).
* fewer hand-optimized algorithm variants have been implemented.
* likely more CPU cycles per hash, though I haven't measured.

Wikimedia page has lotsa info: 
http://en.wikipedia.org/wiki/Secure_Hash_Algorithm

Maybe sha-256 could be considered for the next major-rev of git?

	Jeff


[1] http://www.heise-security.co.uk/news/77244

-

From: Krzysztof Halasa
Date: Sunday, August 27, 2006 - 1:30 pm

Not sure, but _if_ we want it we should do it sooner rather than
later.
-- 
Krzysztof Halasa
-

From: Linus Torvalds
Date: Sunday, August 27, 2006 - 1:46 pm

Modifying git-convert-objects.c to rewrite the regular sha1 into a sha256 
should be fairly straightforward. It's never been used since the early 
days (and has limits like a maximum of a million objects etc that can need 
fixing), but it shouldn't be "fundamentally hard" per se.

		Linus
-

From: Krzysztof Halasa
Date: Sunday, August 27, 2006 - 2:14 pm

Sure. I was rather thinking of rapidly increasing number of git
repositories, each with growing history.
-- 
Krzysztof Halasa
-

From: Johannes Schindelin
Date: Sunday, August 27, 2006 - 3:02 pm

Hi,


But what about signed tags? (This issue has come up before, but never has 
been adressed.)

I also thought about supporting hybrid hashes, i.e. that older objects 
still can be hashed with SHA-1. Alas, a simple thought experiment 
demonstrates how silly that idea is: most of the objects will not change 
between two revisions, and they'd have to be rehashed with SHA-256 (or 
whatever we decide upon) anyway, so hybrids would do no good.

A better idea would be to increment the repository version, and expect 
SHA-1 for version 1, SHA-256 for version >= 2.

However, I could imagine that we do not need this huge change (it would 
break _many_ setups). The breakthrough was announced last Tuesday, and it 
involved 75% payload, i.e. to fake a new -- say -- git.c, one would need 
to enlarge git.c by a factor 4, and you would see a lot of gibberish 
inside some comment. (Note that I did not listen to the talk myself, this 
is all deducted from the scarce information which is available via the 
'net.)

Even if the breakthrough really comes to full SHA-1, you still have to add 
_at least_ 20 bytes of gibberish. Which would be harder to spot, but it 
would be spotted.

This made me think about the use of hashes in git. Why do we need a hash 
here (in no particular order):

1) integrity checking,
2) fast lookup,
3) identifying objects (related to (2)),
4) trust.

Except for (4), I do not see why SHA-1 -- even if broken -- should not be 
adequate. It is not like somebody found out that all JPGs tend to have 
similar hashes so that collisions are more likely.

And thinking about trust: The hash is augmented by thinking persons. It is 
not like you blindly trust a person forever. You build up trust, and once 
you were failed, the trust is lost, and very hard to build up again. So, 
you just would try to get all objects again from somebody you still trust, 
and never pull from the loser^H^H^H^H^Huntrusted person again. Ever.

Besides, as has been pointed out several times, a ...
From: Linus Torvalds
Date: Sunday, August 27, 2006 - 3:35 pm

Signed tags fundamentally have to be re-signed. That's by design: if 
somebody could rewrite an archive and signed tags would still be accepted 
to have the right signature, that would be a _serious_ sign of a totally 
broken security model.


Indeed. Hybrids would not only do no good, but they would actually 
_actively_ hurt things, because they'd fundamentally break the notion that 
the hash being identical means that the object (blob, tree, subtree) is 
the same.

So allowing two names for the same object is very fundamentally wrong in 

Yes. It would be reasonably painful for users, though (as Krzysztof 
correctly points out). Every client would have to convert when a 

Yeah, I don't think this is at all critical, especially since git really 
on a security level doesn't _depend_ on the hashes being cryptographically 
secure. As I explained early on (ie over a year ago, back when the whole 
design of git was being discussed), the _security_ of git actually depends 
on not cryptographic hashes, but simply on everybody being able to secure 
their own _private_ repository.

So the only thing git really _requires_ is a hash that is _unique_ for the 
developer (and there we are talking not of an _attacker_, but a benign 
participant).

That said, the cryptographic security of SHA-1 is obviously a real bonus. 
So I'd be disappointed if SHA-1 can be broken more easily (and I obviously 
already argued against using MD5, exactly because generating duplicates of 
that is fairly easy). But it's not "fundamentally required" in git per se.

[ The one exception: the "signed tags" security does depend on the hashes 
  being cryptographically strong. So again, breaking SHA-1 would not mean 
  that git stops working, but it _would_ potentially mean that if you 
  don't trust your own _private_ repository, the signed tag may no longer 

Correct. I'm pretty sure we had exactly this discussion around May 2005, 
but I'm too lazy to search ;)

		Linus
-

From: David Lang
Date: Monday, August 28, 2006 - 10:27 am

--On Sunday, August 27, 2006 03:35:20 PM -0700 Linus Torvalds 

just to double check.

if you already have a file A in git with hash X is there any condition 
where a remote file with hash X (but different contents) would overwrite 
the local version?

what would happen if you ended up with two packs that both contained a file 
with hash X but with different contents and then did a repack on them? 
(either packs from different sources, or packs downloaded through some 
mechanism other then the git protocol are two ways this could happen that I 
can think of)

David Lang
-

From: Linus Torvalds
Date: Monday, August 28, 2006 - 10:56 am

Nope. If it has the same SHA1, it means that when we receive the object 
from the other end, we will _not_ overwrite the object we already have.

So what happens is that if we ever see a collision, the "earlier" object 
in any particular repository will always end up overriding. But note that 
"earlier" is obviously per-repository, in the sense that the git object 
network generates a DAG that is not fully ordered, so while different 
repositories will agree about what is "earlier" in the case of direct 
ancestry, if the object came through separate and not directly related 
branches, two different repos may obviously have gotten the two objects in 
different order.

However, the "earlier will override" is very much what you want from a 
security standpoint: remember that the git model is that you should 
primarily trust only your _own_ repository. So if you do a "git pull", the 
new incoming objects are by definition less trustworthy than the objects 
you already have, and as such it would be wrong to allow a new object to 
replace an old one.

So you have two cases of collision:

 - the inadvertent kind, where you somehow are very very unlucky, and two 
   files end up having the same SHA1. At that point, what happens is that 
   when you commit that file (or do a "git-update-index" to move it into 
   the index, but not committed yet), the SHA1 of the new contents will be 
   computed, but since it matches an old object, a new object won't be 
   created, and the commit-or-index ends up pointing to the _old_ object.

   You won't notice immediately (since the index will match the old object 
   SHA1, and that means that something like "git diff" will use the 
   checked-out copy), but if you ever do a tree-level diff (or you 
   do a clone or pull, or force a checkout) you'll suddenly notice that 
   that file has changed to something _completely_ different than what you 
   expected. So you would generally notice this kind of collision fairly 
   quickly.

   In ...
From: Linus Torvalds
Date: Monday, August 28, 2006 - 11:06 am

Btw, this is obviously only true for the native git protocol itself.

If the attacker can fool you into generating the new file _yourself_, he 
can cause your checked-out copy to not match the git object database any 
more.

In other words, one "interesting" attack vector is to feed you the 
colliding SHA1 not through a git-to-git transfer, but by generating a 
_patch_ that when applied will generate the collision, so that when you 
then commit that patch, you get something else than you expected.

And _this_ is where it's important that the hash that git uses be a 
non-trivial one - ie we don't want people to be able to generate two files 
that look superficially "ok".

So here's the rule: If you ever get a patch that looks like line-noise, 
especially from somebody you don't trust, DON'T APPLY IT!

Now, that is obviously something you should never do _regardless_ of any 
git issues, so I don't think this is really a problem either. If you apply 
patches from people you don't have a good reason to trust without 
sanity-checking them, you deserve whatever you get, and quite frankly, a 
SHA1 hash collision is the _least_ of your problems ;)

(This ends up boiling down to one common issue: it's generally _much_ 
easier to attack a project through _other_ means than through a hash 
collision. And I pretty much guarantee that that is the case even if we 
were to use a much weaker hash, like MD5. Hash collisions fundamentally 
just aren't good attack vectors, and it's a hell of a lot easier to try 
to insert bad code by other means)

			Linus
-

From: Jeff King
Date: Monday, August 28, 2006 - 11:32 am

This concept breaks down somewhat if you are pulling from two
repositories (one good and one evil). If I pull from the evil repo
first, that will become my "earlier" object, and I will never get the
colliding object from the good repo.

Executing such an attack might not be that hard, either (once we get
over that little hump of creating collisions at will!). The owner of
'evil' has to know a SHA1 that will be in 'good' before it makes it to
'good'. However, I imagine we frequently see SHA1s migrate from more
central repos (like .../torvalds/linux-2.6.git) to less central ones
(subsystem / port maintainers, etc).

-Peff
-

From: Linus Torvalds
Date: Monday, August 28, 2006 - 11:46 am

Sure. But if you are pulling from an untrusted source, you'd better at 
least check the result.

In fact, that's partly why "git pull" will do a diffstat after the pull. 
Exactly to force people to at least be minimally aware of what they 
pulled. And "gitk ORIG_HEAD.." is a great thing to always run when you 
pull from somebody you don't know and trust really well.

Of course, that all was done mostly not because I don't "trust" the people 
I work with, but more because I didn't always trust that they'd do the 
right thing with git (ie they'd screw up the repo not because they were 
evil, but because they made a mistake).

So even if you pull from an "evil" repo first, and you somehow get a "bad" 
object, the point is, the bad object _should_ be the one that overrides. 

Why? Because once you find out that the evil repo was bad (which you'll 
eventually find simply because it caused some bug - if the evil repo only 
helps you, it's obviously not evil at all), what you need to do is reset 
to _before_ the evil repo happened, do a "git repack -a -d" and finally a 
"git prune" to clean out all the bad cruft, and then pull the good repo 
without pulling the bad one first.

After that, you apologize to everybody for screwing up and pulling from 
somebody you didn't trust, and then ask them to re-clone (or give them the 
appropriate "git reset" + "git repack -a" + "git prune" + "git pull" 
sequence so that they can fix their existing repos).

The point being, a hash attack is really no worse than an attack that 
fools you into applying a really bad diff (regardless of SCM), and it's a 
hell of a lot harder to do. Both a hash attack and a diff attack mean that 
the person merging data should either trust his source or inspect the end 
result.

Anybody who just blindly accepts data from untrusted sources is screwed in 
so many other ways that the hash attack simply isn't even on the radar.

		Linus
-

From: Jeff King
Date: Monday, August 28, 2006 - 12:00 pm

I completely agree; however, even discussing "earlier takes precedence"
entails that you are somehow pulling from an untrusted source. I just
wanted to point out that "earlier" does not always mean "more trusted
than the thing you're pulling now" (since it might have just been pulled

Agreed.

-Peff
-

From: Krzysztof Halasa
Date: Monday, August 28, 2006 - 1:12 pm

Actually I think we may see it when somebody tries to put a real
example of conflicting SHA-1 pair into git repository.
-- 
Krzysztof Halasa
-

From: Linus Torvalds
Date: Monday, August 28, 2006 - 1:20 pm

Well, by definition, I wouldn't call that "inadvertent" ;)

Anyway, the way to do it (if you want to use git to document SHA1 hash 
mismatches) is to just check the files that have an identical SHA1 in. It 
will magically work!

Why? Because a git SHA1 is actually _not_ the SHA1 of the file itself, 
it's the SHA1 of the file _with_the_git_header_added_.

So if you find two files that have the same SHA1, they would also have to 
have the same length in order to actually generate the same object name. 
If they have different lenths, you can just check them into git, and 
they'll get two different git SHA1 names and you'll have a cool git 
archive that when you check the files out, they checked-out files will 
share the same SHA1 ;)

		Linus
-

From: Krzysztof Halasa
Date: Monday, August 28, 2006 - 2:12 pm

Well, conflicting files will most probably have the same size,
like with MD5 cases :-)
-- 
Krzysztof Halasa
-

From: Linus Torvalds
Date: Monday, August 28, 2006 - 2:23 pm

That's only true for the much easier injection case where you generate 
_both_ files together.

From an external git hash-attack standpoint, that's not a very useful 
case. It's much more useful if you can make a new file that has a hash 
that matches a given old file, and in that case, the filelengths are 
likely not the same.

			Linus
-

From: Johannes Schindelin
Date: Monday, August 28, 2006 - 4:09 pm

Hi,


The only notable exception I can think of: "git fetch -k". If you then try 
to retrieve the bogus object, it will return the one of whichever pack was 
returned first be readdir(). (If I read the source correctly.)

Now, the cases are rare where you do both "git fetch -k" and "git repack 
-a -d" (the latter of which _could_ leave a hole in the directory which 
_could_ make the next fetched pack fill that hole, which in turn _could_ 
make readdir() return that pack before more "senior" packs) in the same 
repository, but in these cases, yes, you could end up with the copy of the 
remote side.

You'd need to explicitely use "git fetch -k", though.

Ciao,
Dscho

-

From: Linus Torvalds
Date: Monday, August 28, 2006 - 4:48 pm

Good point.

I didn't even think of "-k", since I mentally put that in the "initial 
clone usage only" category, but yeah, if people use it for incremental 
updates too, that could indeed cause ambiguity in which object to use when 
the other end does something bad.

		Linus
-

From: Florian Weimer
Date: Monday, August 28, 2006 - 11:17 pm

According to a quick test using "openssl speed", it's a factor of two
to four, depending on the input size (the difference is less

And in 2008, you'd have to rewrite history again, to use the next
"stronger" hash function?  Do you think that's really necessary or
desirable?  Most users will have good control over what data enters
their repositories, so they can spot the evil twins thanks to their
high-entropy contents.  Obviously, a second preimage attack would
mattr, but even for MD5, we aren't close to that one AFAIK.
-

Previous thread: Re: Problem with pack by Sergio Callegari on Sunday, August 27, 2006 - 10:45 am. (4 messages)

Next thread: Re: [PATCH 0/7] gitweb: Cleanups, fixes and small improvements by Jakub Narebski on Sunday, August 27, 2006 - 1:21 pm. (1 message)