Re: About git and the use of SHA-1

Previous thread: git push: failed to push some refs and missing commit by Nico Schottelius on Monday, April 28, 2008 - 10:10 am. (2 messages)

Next thread: Re: git ML archive for download by Frederik Hohlfeld on Monday, April 28, 2008 - 7:11 am. (2 messages)
From: Henrik Austad
Date: Monday, April 28, 2008 - 9:29 am

Hi list!

As far as I have gathered, the SHA-1-sum is used as a identifier for commit=
s,=20
and that is the primary reason for using sha1.  However, several places=20
(including the google tech-talk featuring Linus himself) states that the id=
's=20
are cryptographically secure.

As discussed in [1], SHA-1 is not as secure as it once was (and this was in=
=20
2005), and I'm wondering - are there any plans for migrating to another=20
hash-algorithm? I.e. SHA-2, whirlpool..

[1] http://www.schneier.com/blog/archives/2005/02/cryptanalysis_o.html
=2D-=20
mvh Henrik Austad
From: Daniel Barkalow
Date: Monday, April 28, 2008 - 12:34 pm

No. The cryptographic security we care about is that it's impractical to 
come up with another set of content that hashes to the same value as a 
given set of content. The known attacks on SHA-1 (and more broken earlier 
hashes in the same general class) only allow the attacker to produce two 
files that will collide. Now, it's true that this would allow somebody to 
produce a commit where some people see the "good" blob and some people see 
the "evil" blob, but (a) the "good" blob contains some large chunk of 
random data, which is a major red flag by itself, and (b) all of these 
people have to be taking data from the attacker.

If somebody gives you some source, and it's got some large random chunk in 
it, and the behavior of the object depends on the content of this chunk, 
and it's unspecified where this chunk comes from, you should be aware 
that they might be able to swap this chunk for a different chunk. But such 
a file is pretty blatantly malicious anyway.

	-Daniel
*This .sig left intentionally blank*
--

From: Henrik Austad
Date: Monday, April 28, 2008 - 2:29 pm

yes, I can see that point, but I was thinking more along the line of:

1) clone repo
2) add malicious code
3) add a huge block of comment, ifdef-block etc somewhere obscure in the co=
de=20
and keep adding random data untill hash matches a well-known release.
4) publish repo, or even worse, change central repo

Most users, and probably a lot of developers never browse through the *enti=
re*=20
archive looking for this, and as long as the hash checks out - why would yo=
u?=20
Yes, it would probably be discovered soon enough, but take the linux kernel=
=20
as an example - if you get, say 100 infected machines due to this, what wou=
ld=20

True, but this actually means you have to verify *everything*, even though =
the=20
hash checks out.

but yes, I can see your point, and it would most likely be infeasible to=20
generate a collision using this approach, and changing to another=20
hashfunction would probably not add much. basically I was just curious and=
=20
played ahead with the idea.

Thanks for the answer though :)
=2D-=20
mvh Henrik Austad
From: Daniel Barkalow
Date: Monday, April 28, 2008 - 3:15 pm

All known methods for step 3, even on hashes considered long broken, will 
take until the heat death of the universe. The latest I can find is that, 
if you use MD4 (which is weak enough that you can find collisions as 
quickly as you can do two hashes), there's a 1 in a quadrillion chance 
that your message is weak and somebody could find a replacement with the 
same hash using known techniques. (With a plausible amount of work, an 
attacker could take a file and modify it only slightly, and find a 
replacement for that, but this again requires the attacker to have some 
non-trivial input to what gets put in the official tree, which leaves 
the attacker as the responsible party for that object).

SHA-1 is enough stronger that the latest attacks are still unable to do 
with the current available computing power in years what can be done to 
MD4 in milliseconds. So it's highly unlikely that somebody will break 

If you don't verify *everything* when the hash checks out, the attacker 
will just send you a properly-constructed commit with a back door in the 
code. While you're looking for directly-inserted security holes in the 
code, you can probably notice if there's some big hunk of line noise in a 
comment that might make the file vulnerable to replacement.

	-Daniel
*This .sig left intentionally blank*
--

From: Andreas Ericsson
Date: Monday, April 28, 2008 - 11:38 pm

This depends greatly on git accepting objects with a colliding object-name,
which it doesn't. Once you have an object with a particular SHA1, it will
never get overwritten, ever, as git will believe it's about to do unnecessary
work. As such, you'd still have to create a new object, hashing to a new SHA1
and get that new object added to the kernel.

I think perhaps Andrew Morton and a few other "high brass" among the kernel
hackers can get away with pushing crud like that to Linus' public tree
(which is the de facto master copy of published kernel sources), but random
John Doe's such as you and me wouldn't stand a chance, as our patches would
get reviewed by someone who, at the end of the day, makes a living coding

That depends. If the source of it was Linus' public tree, that would not be
very good at all. If the source was a random tarball off a random webpage
or ftp site (which would be the same as fetching and, unverified, using an

Not really. What you need to verify is that
a) You cloned from somewhere you trust (kernel.org, fe)
b) The SHA1 of the commit you want to build from matches the SHA1 of the same
commit in the repository you originally cloned from.

Colliding objects can never enter a repository. Git is lazy and will reuse the
already existing colliding object with the same name instead.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
--

From: Russ Dill
Date: Tuesday, April 29, 2008 - 12:09 am

I think you are missing the point. One of the pluses behind originally
using SHA-1 and the signed tags is that the system as a whole is
cryptographically secure. You can verify from the public key of
whoever made the tag that yes, this really is the source and history
they tagged. Not only can DNS attacks be made, fooling users into
thinking that they are really connecting to kernel.org, or whatever
else server they expect to be connecting to, but also, the server
itself may be hacked and objects replaced.

I'm just not sure how much time it would take to find a collision.
--

From: Andreas Ericsson
Date: Tuesday, April 29, 2008 - 12:21 am

If the server is hacked and objects are replaced, they will either
no longer match their cryptographic signature, meaning they'll be
new objects or git will determine that they are corrupt, or they
*will* match an existing object, but then that object won't be
propagated to other repositories since git refuses to overwrite
already existing objects. Either way, gits refusal to overwrite
objects it already has plays a part in making malicious actions
futile, since malicious code is only worth something if it's

Even crypto-experts are arguing about that, so I'm not surprised.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
--

From: Sverre Rabbelier
Date: Tuesday, April 29, 2008 - 4:05 am

We were assuming here that once SHA-1 is broken really determined
hackers will be able to come up with objects that -do- match the

What about new users cloning the repo? They're just out of luck? I
don't think this argument holds, if we want to 'advertise' that git is
cryptographically secure we can do so only as long as our hashing
algorithm is. (As such, should SHA-1 ever be fully broken we'd need to
either switch to another algorithm or stop advertising being

Of course this is true, it makes it a lot harder to do damage, but it
doesn't eliminate the problem, it's just a free 'extra protection'.
Yes, malicious code is only worth something if it's propagated and
actually used, no, it is not impossible to do so in git if/when SHA-1
turns out to have collisions every other file.

-- 
Cheers,

Sverre Rabbelier
--

From: Andreas Ericsson
Date: Tuesday, April 29, 2008 - 5:27 am

Only until someone who's already cloned the repository fetches

True. So far though, the only attacks that have been successful requires
that the attacker is allowed to create both the colliding data-sets,
and so far none has been found that would allow the attacker to follow
any kind of syntactical rules what so ever, so from a practical point
of view, SHA1 is 100% secure *for sourcecode*.

From a theoretical point of view, no hash is 100% secure, so changing
algorithm buys us nothing.

Besides, "cryprographically secure" is not the same as "will never ever
be broken", because all hashes are obviously susceptible to brute-force
attacks. "Cryptographically secure" means, insofar as I've understood it
that given a source-file and a key, it would take such an extremely
long time to find a different data-set that hashes to the same key that
the result is unusable because the original source is obsolete.

That is why legal documents are always signed with the "most secure"
(or rather, "least insecure") of all available hashes. For our
purposes, SHA1 suffices until someone comes up with a relatively

Points of fact so far:
* It possible to create objects with colliding names (SHA1 hash keys).
  This holds true whichever algorithm we use, although it will be more
  difficult with a stronger algorithm.
* It is impossible to distribute the colliding content to already cloned
  repositories. This also holds true for all hash algorithms.

I've been arguing that the value of the first point is so greatly
diminished by the second, that even if SHA1 turns out to be horribly
broken, projects using git will still have a decent protection against
malicious code entering the repository without the knowledge of one of
the authors.

You've been arguing that SHA1 is not theoretically secure, which is
obviously true since no hash is theoretically secure.

I can think of one way to make git a lot more resilient to hash
collisions, regardless of which hash is used, namely: Add the length
of ...
From: Paolo Bonzini
Date: Tuesday, April 29, 2008 - 6:05 am

Not really, because most attacks are about collisions, not second 
preimages.  They produce two 64-byte blocks (hence, same length) with 
the same hash value.

As such, they allow to change a blob that *the attacker* injected in the 
repository.  The way the more "spectacular" attacks are devised requires 
a "language" with conditional expressions -- for documents, for example, 
Postscript is used.  If you prepare a postscript file whose code is

    if (AAAA == BBBB)
      typeset document 1
    else
      typeset document 2

where AAAA and BBBB are collisions, and you change it to "if (BBBB == 
BBBB) the hash will be the same, but the outcome will be document 1 
instead of document 2.

The fact that this requires having the two "behaviors" in the blob is 
not a big deal for source code, going in the wrong branch of an "if" can 
be an attack.  On the other hand, it makes adding the length useless for 
collision attacks.  True, it wouldn't be useless for second preimage 
attacks, but SHA-1 is still secure with respect to those.

Paolo
--

From: Andreas Ericsson
Date: Tuesday, April 29, 2008 - 7:37 am

So what you're saying is that if someone owns a repository and adds a
file to it, he can then replace his entire repository with an identical
one where the good file is replaced with a bad one, and this will affect
people who clone *after* the file gets replaced.

Gee, that's one fiendishly large attack vector, quite apart from the
fact that said author first has to come up with a program that gets
widespread enough that a lot of people all of a sudden wants to use
it, but not so widespread that anyone would want to review it before
using it.

I remain unconvinced as to whether or not SHA1 is, for all practical
purposes, cryptographically secure for git's uses. Sure, evil programmers
can screw you over if you use their software without reviewing it, but
that's hardly due to git using a particular cryptographic algorithm.

Otoh, I'm not familiar enough with the nomenclature to say with 100%
certainty what's cryprographically secure and what isn't. I just know
that there are no collision-less hashes, so whatever "cryptographically
secure" really means wrt hashes, "100% collision-free" isn't it.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
--

From: Paolo Bonzini
Date: Tuesday, April 29, 2008 - 7:52 am

I agree (with the irony).

Paolo
--

From: Russ Dill
Date: Tuesday, April 29, 2008 - 9:24 am

No, if someone 0wnz a repository, not owns (Or really, malicious
mirror owners could be in on it). Either that or some form of
redirection attack. When you download a tarball, you can check the
signed checksum that is downloadable along with it. When you clone a
repo, you depend on signed tags.
--

From: Jurko Gospodnetić
Date: Tuesday, April 29, 2008 - 5:46 am

I am not really sure I follow this.... how can you 'verify from the 
public key of whoever made the tag' that the SHA-1 hash is correct!? 
SHA-1 does not have anything do with any externally provided keys or 
have I managed to get something confused here?

   Best regards,
     Jurko Gospodnetić

--

From: Russ Dill
Date: Tuesday, April 29, 2008 - 9:21 am

On Tue, Apr 29, 2008 at 5:46 AM, Jurko Gospodnetić

Sorry for the confusion, its about using the signed tag and the SHA-1
of the parent commits, along with their associated trees and blobs to
verify the source and history. If you can't trust the signed tag, or
all of the SHA-1's, you can't trust the source and history.

However, as many said, I don't think there is any reason to not trust
SHA-1 is the context of source control.
From: Geoffrey Irving
Date: Tuesday, April 29, 2008 - 8:34 am

This argument is invalid, since the use of git is not limited to
source code.  People
can and do store unreadable binary data in git, and unless you are completely
sure that no one would ever care about the security of that data in a
way that can
be attacked with a single collision, git should be secure about those as well.

For example, I just converted a 20 GB repository to git which, among
other things,
contains pdf files of my tax returns.  I have looked them over, but I
have not opened
them in a hex editor and looked them over at the binary level, and I
don't think git
should expect me to.

Incidentally, git was the only version control system I tried except
for subversion that
didn't choke on that repository.  Mercurial looked at my file renames
and expanded
the size past 45 GB before I killed it, I had to fix a several bugs in
the bazaar conversion
scripts before I realized it was just too slow, and svk turns out to
be even more like
the Antichrist than subversion itself is (mirroring N repository
copies requires an N-fold
increase in size).

Geoffrey
--

From: Daniel Barkalow
Date: Tuesday, April 29, 2008 - 9:27 am

If you haven't looked over your PDFs with a hex editor, you're depending 
on the security of the software generating the PDFs and on what you did in 
generating them. (Looking at the resulting image alone may be unwise if, 
for example, you redacted anything.) In any case, on the basis of your 
actions, you may this commit. Now, anyone receiving the repository can, 
due to the lack of second preimage attacks, be sure that (a) the document 
is as you committed it; or (b) the document is different from what you 
committed, but you made the substitution; or (c) the document is different 
from what you committed, and you were tricked into committing a document 
carefully designed by somebody else to be weak. Additionally, it's 
infeasible to create a document such that forensics after the fact can't 
turn up both the content as originally shown and the content as swapped 
from either document.

I'm also not confident that PDFs are, in general, not vulnerable to an 
attack where they rasterize entirely differently depending on 
environmental factors (e.g., the document you're signing says something 
entirely different when printed on A4 paper than what it says printed on 
Letter); if so, it doesn't matter much that the document could be 
replaced, since an attacker could just control the environment and get the 
same effect.

In any case, an attacker can't come along later and make a replacement of 
a file that originated in your commit. Also, you know that any sets of 
interchangable documents had already been created when you get a commit 
that contains one of them.

	-Daniel
*This .sig left intentionally blank*
--

From: Dmitry Potapov
Date: Tuesday, April 29, 2008 - 5:41 am

SHA-1 is broken in the sense that it requires computation less than
finding a collision  by brute force (2^80). It is still very costly and
AFAIK no one yet has found a single collision for SHA-1 yet, but even if
such a collision is found, the question is how it can be exploit?

This collision cannot be used to replace any existing code in Git. The
only way to exploit this collision is to submit a patch based on one
sequence to the maintainer and it should look legitimate to be accepted
and then create another blob with malicious code based on the other
sequence, so the second blob has the same SHA-1 then anyone who pulls
from you will get malicious code.

However, it is tricky to create these two blobs -- one which should pass
inspection and look like as a real improvement but the other one that
should do what you want. All what you have is two sequences of 20 bytes
with the same SHA-1 and you have no control over them. For some binary
files, it is possible by including both good and bad contents in the
submitted blob and using one sequence in the right place to hide the bad
part and make only the good one active/visible. Then the other blob will
be almost the same but contains the other sequence, which is used to
activate the bad part. This can work if the maintainer cannot see
everything but only the "visible" part. However, I don't think you can
do anything like that with _source_ code, which is inspect. And if
submitted code is not reviewed, there is nothing that can protect you
from malicious code getting into the repository (and even worse it will
get directly into the official repository!).

So, I don't think we have to worry much about possibility a collision
attack, but only about preimage attacks; and a preimage attack on SHA-1
is far away from reality.

Dmitry
--

From: Andreas Ericsson
Date: Tuesday, April 29, 2008 - 7:41 am

But they won't, because it's impossible to add two objects with the same
SHA1 hash key to a git repository, since it will lazily re-use the
existing one. In practice, this means that in the case of an "innocent"
hash-collision, git will actually break by refusing to store the new

Right.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
--

From: Nicolas Pitre
Date: Tuesday, April 29, 2008 - 8:42 am

I'd also like to point out that Git usually receive "untrusted" new 
objects via the Git protocol through 'git index-pack'.  If you look at 
sha1_object() in index-pack.c, you'll see that active verification 
against hash collision is performed, and the fetch will abruptly be 
aborted if ever that happens.

Yes, writing a test case for this was tricky.  :-)


Nicolas
--

From: Geoffrey Irving
Date: Tuesday, April 29, 2008 - 8:59 am

Here's the standard scenario for a hash collision attack, with
parties, A, B, and C:

1. C, the malicious one, computes the standard two pdfs with matching
sha1 hashes.
2. C sends the valid pdf to B through a git commit, and B signs it with a tag.
3. C grabs the signature, and then forwards the "signed" commit to A,
but substitutes the invalid pdf with the same hash.

The fact that git will check for hash collisions within one repository
is nice, but it doesn't significantly increase the security of git
against hash collision attacks.

Geoffrey
--

From: Nicolas Pitre
Date: Tuesday, April 29, 2008 - 9:39 am

Sure.  But this is all complete handwaving until a practical collision 
can be demonstrated.  So far the demonstration hasn't happened, 
practical or not.


Nicolas
--

From: Geoffrey Irving
Date: Tuesday, April 29, 2008 - 10:48 am

Sorry for the confusion: it would handwaving if I was saying git was insecure,
but I'm not.  I'm saying that if or when SHA1 becomes vulnerable to collision
attacks, git will be insecure.

Geoffrey
--

From: Nicolas Pitre
Date: Tuesday, April 29, 2008 - 10:55 am

Right.  And if or when that happens then we'll make Git secure again 
with a different hash.  In the mean time there is low return for the 
effort involved.


Nicolas
--

From: Geoffrey Irving
Date: Tuesday, April 29, 2008 - 11:02 am

Yes.  I wasn't trying to advocate switching, just making sure people
know that the "collisions don't matter" argument is bogus.

One important thing: when SHA1 becomes vulnerable to collision
attacks, it will still be secure to trust the repositories and tags
that exist *at that moment.*  I.e., the transition period from SHA1 to
the next hash will also be secure, assuming that preimage attacks
don't become possible simultaneously.  So everything is good.

Geoffrey
--

From: Daniel Barkalow
Date: Tuesday, April 29, 2008 - 11:41 am

It's bogus to say they completely don't matter, but I still claim that 
they don't matter for the things people actually care about. If people can 
generate collisions, they can commit a "weak" blob with a conditional that 
can be switched by replacing the blob. But it's almost always true that 
people could commit a blob with a conditional that can be switched by 
something else under the attacker's more direct control. Using a better 
hash function won't save you from a document like:

if (getdate() < 2009)
  render_good_text
else
  render_evil_text

even if it does help with:

if (AA == AA)
  render_good_text
else
  render_evil_text

If you're not checking your files for the former, you shouldn't worry 
about the latter, because the former is much easier and more subtle.

(Now, an arbitrary preimage attack would actually be significant, still, 
because the attacker could replace an honestly-created "restrictive 
security policy" file with garbage that will be ignored, leaving stuff 
unprotected)

	-Daniel
*This .sig left intentionally blank*
--

From: Geoffrey Irving
Date: Tuesday, April 29, 2008 - 1:31 pm

I sincerely hope that pdf/postscript don't allow the internal
rendering code to branch based on the current date.  That would be an
absurd security hole, and would indeed make you entirely correct.  If
you actually know that it is possible to write that in postscript, I
would very much want to see an example.

In any case, in a binary document format that isn't insane (examples
of these at least include black and white .png images of documents), a
visual check of the content is sufficient to ensure that the next
person who looks at it will see roughly the same visual content.  Git
should be (and currently is) a secure method of transferring sane
binary documents.

Geoffrey
--

From: Geoffrey Irving
Date: Tuesday, April 29, 2008 - 2:39 pm

This is an example of a hash collision, not conditional rendering
based on the current date.  I.e., you didn't actually read my email or
the email I was replying to. :)

Geoffrey
--

From: Fredrik Skolmli
Date: Tuesday, April 29, 2008 - 2:52 pm

Ah, you're right. Didn't notice the part about dates. Sorry ;-)

-- 
Regards,
Fredrik Skolmli
--

From: Martin Langhoff
Date: Tuesday, April 29, 2008 - 7:58 pm

PS is Turing complete, and does know about dates. So yes, you can make
such conditionals.

That original md5 paper with the 2 PDF files is mainly a good example
that you should trust binary blobs, that's all. The md5 trick is a
nice demo, but misses the point entirely.

I can't find it now, but someone had written a PDF file that printed
Pi computing in inside the PS VM. The tiny file would keep the printer
churning out paper until it ran out of memory. :-)

cheers,


m
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
--

From: Geoffrey Irving
Date: Tuesday, April 29, 2008 - 10:18 pm

On Tue, Apr 29, 2008 at 7:58 PM, Martin Langhoff

I knew postscript was Turing complete, but had (naively) assumed it
executed sandboxed and deterministically and would therefore display
uniformly barring interpreter bugs.  Looking over the spec, I can't
find where it's possible to read the current date, but the
usertime/realtime variables are sufficient as long as the attacker

According to wikipedia, PDF doesn't have conditionals or loops of any
kind, so you probably mean a postscript file.

Geoffrey
--

From: David Brown
Date: Tuesday, April 29, 2008 - 10:47 pm

usertime and realtime are from the start of the invocation of the
postscript interpreter, not based on the outside world.  So, the
interpreter could wait arbitrarily long, but has no way of knowing any
external reference to time.

I could imagine trickery with PDF signatures and their expiration times,
but you shouldn't be able to do anything with the information, so it would
be an exploit, and would probably be fixed.

David
--

From: Martin Langhoff
Date: Tuesday, April 29, 2008 - 10:56 pm

You guys are right - I misremembered the spec wrt dates. I had the
distinct impression that there was a way to get the epoch.

Sorry about the noise.



martin
-- 
 martin.langhoff@gmail.com
 martin@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
--

From: Matthieu Moy
Date: Tuesday, April 29, 2008 - 11:17 am

Just to add my 2 cents, examples of this are available on the web,
like:

http://th.informatik.uni-mannheim.de/People/Lucks/HashCollisions/

Same size, same hash. But that's with md5, not sha1.

-- 
Matthieu
--

From: Fredrik Skolmli
Date: Tuesday, April 29, 2008 - 11:23 am

Well yes, but that's still using the methods already mentioned in this
thread. So you do have to get your "good" code approved before replacing it
with something nasty.

- Fredrik

-- 
Regards,
Fredrik Skolmli
--

From: Tom Widmer
Date: Tuesday, April 29, 2008 - 8:02 am

Why not wait until the results of:

are available. That will surely be soon enough.

Tom

--

From: Tom Widmer
Date: Tuesday, April 29, 2008 - 10:08 am

Why not wait until the results of:
http://www.csrc.nist.gov/groups/ST/hash/index.html
are available. That will surely be soon enough (I think 2012 is the
expected finish date), and should prevent having to switch again in the
future.

The necessity or otherwise of improving the hashing will be clearer by
then too.

Tom

--

Previous thread: git push: failed to push some refs and missing commit by Nico Schottelius on Monday, April 28, 2008 - 10:10 am. (2 messages)

Next thread: Re: git ML archive for download by Frederik Hohlfeld on Monday, April 28, 2008 - 7:11 am. (2 messages)