Hi list! As far as I have gathered, the SHA-1-sum is used as a identifier for commit= s,=20 and that is the primary reason for using sha1. However, several places=20 (including the google tech-talk featuring Linus himself) states that the id= 's=20 are cryptographically secure. As discussed in [1], SHA-1 is not as secure as it once was (and this was in= =20 2005), and I'm wondering - are there any plans for migrating to another=20 hash-algorithm? I.e. SHA-2, whirlpool.. [1] http://www.schneier.com/blog/archives/2005/02/cryptanalysis_o.html =2D-=20 mvh Henrik Austad
No. The cryptographic security we care about is that it's impractical to come up with another set of content that hashes to the same value as a given set of content. The known attacks on SHA-1 (and more broken earlier hashes in the same general class) only allow the attacker to produce two files that will collide. Now, it's true that this would allow somebody to produce a commit where some people see the "good" blob and some people see the "evil" blob, but (a) the "good" blob contains some large chunk of random data, which is a major red flag by itself, and (b) all of these people have to be taking data from the attacker. If somebody gives you some source, and it's got some large random chunk in it, and the behavior of the object depends on the content of this chunk, and it's unspecified where this chunk comes from, you should be aware that they might be able to swap this chunk for a different chunk. But such a file is pretty blatantly malicious anyway. -Daniel *This .sig left intentionally blank* --
yes, I can see that point, but I was thinking more along the line of: 1) clone repo 2) add malicious code 3) add a huge block of comment, ifdef-block etc somewhere obscure in the co= de=20 and keep adding random data untill hash matches a well-known release. 4) publish repo, or even worse, change central repo Most users, and probably a lot of developers never browse through the *enti= re*=20 archive looking for this, and as long as the hash checks out - why would yo= u?=20 Yes, it would probably be discovered soon enough, but take the linux kernel= =20 as an example - if you get, say 100 infected machines due to this, what wou= ld=20 True, but this actually means you have to verify *everything*, even though = the=20 hash checks out. but yes, I can see your point, and it would most likely be infeasible to=20 generate a collision using this approach, and changing to another=20 hashfunction would probably not add much. basically I was just curious and= =20 played ahead with the idea. Thanks for the answer though :) =2D-=20 mvh Henrik Austad
All known methods for step 3, even on hashes considered long broken, will take until the heat death of the universe. The latest I can find is that, if you use MD4 (which is weak enough that you can find collisions as quickly as you can do two hashes), there's a 1 in a quadrillion chance that your message is weak and somebody could find a replacement with the same hash using known techniques. (With a plausible amount of work, an attacker could take a file and modify it only slightly, and find a replacement for that, but this again requires the attacker to have some non-trivial input to what gets put in the official tree, which leaves the attacker as the responsible party for that object). SHA-1 is enough stronger that the latest attacks are still unable to do with the current available computing power in years what can be done to MD4 in milliseconds. So it's highly unlikely that somebody will break If you don't verify *everything* when the hash checks out, the attacker will just send you a properly-constructed commit with a back door in the code. While you're looking for directly-inserted security holes in the code, you can probably notice if there's some big hunk of line noise in a comment that might make the file vulnerable to replacement. -Daniel *This .sig left intentionally blank* --
This depends greatly on git accepting objects with a colliding object-name, which it doesn't. Once you have an object with a particular SHA1, it will never get overwritten, ever, as git will believe it's about to do unnecessary work. As such, you'd still have to create a new object, hashing to a new SHA1 and get that new object added to the kernel. I think perhaps Andrew Morton and a few other "high brass" among the kernel hackers can get away with pushing crud like that to Linus' public tree (which is the de facto master copy of published kernel sources), but random John Doe's such as you and me wouldn't stand a chance, as our patches would get reviewed by someone who, at the end of the day, makes a living coding That depends. If the source of it was Linus' public tree, that would not be very good at all. If the source was a random tarball off a random webpage or ftp site (which would be the same as fetching and, unverified, using an Not really. What you need to verify is that a) You cloned from somewhere you trust (kernel.org, fe) b) The SHA1 of the commit you want to build from matches the SHA1 of the same commit in the repository you originally cloned from. Colliding objects can never enter a repository. Git is lazy and will reuse the already existing colliding object with the same name instead. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 --
I think you are missing the point. One of the pluses behind originally using SHA-1 and the signed tags is that the system as a whole is cryptographically secure. You can verify from the public key of whoever made the tag that yes, this really is the source and history they tagged. Not only can DNS attacks be made, fooling users into thinking that they are really connecting to kernel.org, or whatever else server they expect to be connecting to, but also, the server itself may be hacked and objects replaced. I'm just not sure how much time it would take to find a collision. --
If the server is hacked and objects are replaced, they will either no longer match their cryptographic signature, meaning they'll be new objects or git will determine that they are corrupt, or they *will* match an existing object, but then that object won't be propagated to other repositories since git refuses to overwrite already existing objects. Either way, gits refusal to overwrite objects it already has plays a part in making malicious actions futile, since malicious code is only worth something if it's Even crypto-experts are arguing about that, so I'm not surprised. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 --
We were assuming here that once SHA-1 is broken really determined hackers will be able to come up with objects that -do- match the What about new users cloning the repo? They're just out of luck? I don't think this argument holds, if we want to 'advertise' that git is cryptographically secure we can do so only as long as our hashing algorithm is. (As such, should SHA-1 ever be fully broken we'd need to either switch to another algorithm or stop advertising being Of course this is true, it makes it a lot harder to do damage, but it doesn't eliminate the problem, it's just a free 'extra protection'. Yes, malicious code is only worth something if it's propagated and actually used, no, it is not impossible to do so in git if/when SHA-1 turns out to have collisions every other file. -- Cheers, Sverre Rabbelier --
Only until someone who's already cloned the repository fetches True. So far though, the only attacks that have been successful requires that the attacker is allowed to create both the colliding data-sets, and so far none has been found that would allow the attacker to follow any kind of syntactical rules what so ever, so from a practical point of view, SHA1 is 100% secure *for sourcecode*. From a theoretical point of view, no hash is 100% secure, so changing algorithm buys us nothing. Besides, "cryprographically secure" is not the same as "will never ever be broken", because all hashes are obviously susceptible to brute-force attacks. "Cryptographically secure" means, insofar as I've understood it that given a source-file and a key, it would take such an extremely long time to find a different data-set that hashes to the same key that the result is unusable because the original source is obsolete. That is why legal documents are always signed with the "most secure" (or rather, "least insecure") of all available hashes. For our purposes, SHA1 suffices until someone comes up with a relatively Points of fact so far: * It possible to create objects with colliding names (SHA1 hash keys). This holds true whichever algorithm we use, although it will be more difficult with a stronger algorithm. * It is impossible to distribute the colliding content to already cloned repositories. This also holds true for all hash algorithms. I've been arguing that the value of the first point is so greatly diminished by the second, that even if SHA1 turns out to be horribly broken, projects using git will still have a decent protection against malicious code entering the repository without the knowledge of one of the authors. You've been arguing that SHA1 is not theoretically secure, which is obviously true since no hash is theoretically secure. I can think of one way to make git a lot more resilient to hash collisions, regardless of which hash is used, namely: Add the length of ...
Not really, because most attacks are about collisions, not second
preimages. They produce two 64-byte blocks (hence, same length) with
the same hash value.
As such, they allow to change a blob that *the attacker* injected in the
repository. The way the more "spectacular" attacks are devised requires
a "language" with conditional expressions -- for documents, for example,
Postscript is used. If you prepare a postscript file whose code is
if (AAAA == BBBB)
typeset document 1
else
typeset document 2
where AAAA and BBBB are collisions, and you change it to "if (BBBB ==
BBBB) the hash will be the same, but the outcome will be document 1
instead of document 2.
The fact that this requires having the two "behaviors" in the blob is
not a big deal for source code, going in the wrong branch of an "if" can
be an attack. On the other hand, it makes adding the length useless for
collision attacks. True, it wouldn't be useless for second preimage
attacks, but SHA-1 is still secure with respect to those.
Paolo
--
So what you're saying is that if someone owns a repository and adds a file to it, he can then replace his entire repository with an identical one where the good file is replaced with a bad one, and this will affect people who clone *after* the file gets replaced. Gee, that's one fiendishly large attack vector, quite apart from the fact that said author first has to come up with a program that gets widespread enough that a lot of people all of a sudden wants to use it, but not so widespread that anyone would want to review it before using it. I remain unconvinced as to whether or not SHA1 is, for all practical purposes, cryptographically secure for git's uses. Sure, evil programmers can screw you over if you use their software without reviewing it, but that's hardly due to git using a particular cryptographic algorithm. Otoh, I'm not familiar enough with the nomenclature to say with 100% certainty what's cryprographically secure and what isn't. I just know that there are no collision-less hashes, so whatever "cryptographically secure" really means wrt hashes, "100% collision-free" isn't it. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 --
I agree (with the irony). Paolo --
No, if someone 0wnz a repository, not owns (Or really, malicious mirror owners could be in on it). Either that or some form of redirection attack. When you download a tarball, you can check the signed checksum that is downloadable along with it. When you clone a repo, you depend on signed tags. --
I am not really sure I follow this.... how can you 'verify from the
public key of whoever made the tag' that the SHA-1 hash is correct!?
SHA-1 does not have anything do with any externally provided keys or
have I managed to get something confused here?
Best regards,
Jurko Gospodnetić
--
On Tue, Apr 29, 2008 at 5:46 AM, Jurko Gospodnetić Sorry for the confusion, its about using the signed tag and the SHA-1 of the parent commits, along with their associated trees and blobs to verify the source and history. If you can't trust the signed tag, or all of the SHA-1's, you can't trust the source and history. However, as many said, I don't think there is any reason to not trust SHA-1 is the context of source control.
This argument is invalid, since the use of git is not limited to source code. People can and do store unreadable binary data in git, and unless you are completely sure that no one would ever care about the security of that data in a way that can be attacked with a single collision, git should be secure about those as well. For example, I just converted a 20 GB repository to git which, among other things, contains pdf files of my tax returns. I have looked them over, but I have not opened them in a hex editor and looked them over at the binary level, and I don't think git should expect me to. Incidentally, git was the only version control system I tried except for subversion that didn't choke on that repository. Mercurial looked at my file renames and expanded the size past 45 GB before I killed it, I had to fix a several bugs in the bazaar conversion scripts before I realized it was just too slow, and svk turns out to be even more like the Antichrist than subversion itself is (mirroring N repository copies requires an N-fold increase in size). Geoffrey --
If you haven't looked over your PDFs with a hex editor, you're depending on the security of the software generating the PDFs and on what you did in generating them. (Looking at the resulting image alone may be unwise if, for example, you redacted anything.) In any case, on the basis of your actions, you may this commit. Now, anyone receiving the repository can, due to the lack of second preimage attacks, be sure that (a) the document is as you committed it; or (b) the document is different from what you committed, but you made the substitution; or (c) the document is different from what you committed, and you were tricked into committing a document carefully designed by somebody else to be weak. Additionally, it's infeasible to create a document such that forensics after the fact can't turn up both the content as originally shown and the content as swapped from either document. I'm also not confident that PDFs are, in general, not vulnerable to an attack where they rasterize entirely differently depending on environmental factors (e.g., the document you're signing says something entirely different when printed on A4 paper than what it says printed on Letter); if so, it doesn't matter much that the document could be replaced, since an attacker could just control the environment and get the same effect. In any case, an attacker can't come along later and make a replacement of a file that originated in your commit. Also, you know that any sets of interchangable documents had already been created when you get a commit that contains one of them. -Daniel *This .sig left intentionally blank* --
SHA-1 is broken in the sense that it requires computation less than finding a collision by brute force (2^80). It is still very costly and AFAIK no one yet has found a single collision for SHA-1 yet, but even if such a collision is found, the question is how it can be exploit? This collision cannot be used to replace any existing code in Git. The only way to exploit this collision is to submit a patch based on one sequence to the maintainer and it should look legitimate to be accepted and then create another blob with malicious code based on the other sequence, so the second blob has the same SHA-1 then anyone who pulls from you will get malicious code. However, it is tricky to create these two blobs -- one which should pass inspection and look like as a real improvement but the other one that should do what you want. All what you have is two sequences of 20 bytes with the same SHA-1 and you have no control over them. For some binary files, it is possible by including both good and bad contents in the submitted blob and using one sequence in the right place to hide the bad part and make only the good one active/visible. Then the other blob will be almost the same but contains the other sequence, which is used to activate the bad part. This can work if the maintainer cannot see everything but only the "visible" part. However, I don't think you can do anything like that with _source_ code, which is inspect. And if submitted code is not reviewed, there is nothing that can protect you from malicious code getting into the repository (and even worse it will get directly into the official repository!). So, I don't think we have to worry much about possibility a collision attack, but only about preimage attacks; and a preimage attack on SHA-1 is far away from reality. Dmitry --
But they won't, because it's impossible to add two objects with the same SHA1 hash key to a git repository, since it will lazily re-use the existing one. In practice, this means that in the case of an "innocent" hash-collision, git will actually break by refusing to store the new Right. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 --
I'd also like to point out that Git usually receive "untrusted" new objects via the Git protocol through 'git index-pack'. If you look at sha1_object() in index-pack.c, you'll see that active verification against hash collision is performed, and the fetch will abruptly be aborted if ever that happens. Yes, writing a test case for this was tricky. :-) Nicolas --
Here's the standard scenario for a hash collision attack, with parties, A, B, and C: 1. C, the malicious one, computes the standard two pdfs with matching sha1 hashes. 2. C sends the valid pdf to B through a git commit, and B signs it with a tag. 3. C grabs the signature, and then forwards the "signed" commit to A, but substitutes the invalid pdf with the same hash. The fact that git will check for hash collisions within one repository is nice, but it doesn't significantly increase the security of git against hash collision attacks. Geoffrey --
Sure. But this is all complete handwaving until a practical collision can be demonstrated. So far the demonstration hasn't happened, practical or not. Nicolas --
Sorry for the confusion: it would handwaving if I was saying git was insecure, but I'm not. I'm saying that if or when SHA1 becomes vulnerable to collision attacks, git will be insecure. Geoffrey --
Right. And if or when that happens then we'll make Git secure again with a different hash. In the mean time there is low return for the effort involved. Nicolas --
Yes. I wasn't trying to advocate switching, just making sure people know that the "collisions don't matter" argument is bogus. One important thing: when SHA1 becomes vulnerable to collision attacks, it will still be secure to trust the repositories and tags that exist *at that moment.* I.e., the transition period from SHA1 to the next hash will also be secure, assuming that preimage attacks don't become possible simultaneously. So everything is good. Geoffrey --
It's bogus to say they completely don't matter, but I still claim that they don't matter for the things people actually care about. If people can generate collisions, they can commit a "weak" blob with a conditional that can be switched by replacing the blob. But it's almost always true that people could commit a blob with a conditional that can be switched by something else under the attacker's more direct control. Using a better hash function won't save you from a document like: if (getdate() < 2009) render_good_text else render_evil_text even if it does help with: if (AA == AA) render_good_text else render_evil_text If you're not checking your files for the former, you shouldn't worry about the latter, because the former is much easier and more subtle. (Now, an arbitrary preimage attack would actually be significant, still, because the attacker could replace an honestly-created "restrictive security policy" file with garbage that will be ignored, leaving stuff unprotected) -Daniel *This .sig left intentionally blank* --
I sincerely hope that pdf/postscript don't allow the internal rendering code to branch based on the current date. That would be an absurd security hole, and would indeed make you entirely correct. If you actually know that it is possible to write that in postscript, I would very much want to see an example. In any case, in a binary document format that isn't insane (examples of these at least include black and white .png images of documents), a visual check of the content is sufficient to ensure that the next person who looks at it will see roughly the same visual content. Git should be (and currently is) a secure method of transferring sane binary documents. Geoffrey --
Have a look at * http://th.informatik.uni-mannheim.de/People/Lucks/HashCollisions/letter_of_rec.ps vs * http://th.informatik.uni-mannheim.de/People/Lucks/HashCollisions/order.ps both found on a website[1] already mentioned[2] in this thread. :-) [1]: http://th.informatik.uni-mannheim.de/People/Lucks/HashCollisions/ [2]: http://marc.info/?l=git&m=120949349923584&w=2 - F -- Regards, Fredrik Skolmli --
This is an example of a hash collision, not conditional rendering based on the current date. I.e., you didn't actually read my email or the email I was replying to. :) Geoffrey --
Ah, you're right. Didn't notice the part about dates. Sorry ;-) -- Regards, Fredrik Skolmli --
PS is Turing complete, and does know about dates. So yes, you can make such conditionals. That original md5 paper with the 2 PDF files is mainly a good example that you should trust binary blobs, that's all. The md5 trick is a nice demo, but misses the point entirely. I can't find it now, but someone had written a PDF file that printed Pi computing in inside the PS VM. The tiny file would keep the printer churning out paper until it ran out of memory. :-) cheers, m -- martin.langhoff@gmail.com martin@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff --
On Tue, Apr 29, 2008 at 7:58 PM, Martin Langhoff I knew postscript was Turing complete, but had (naively) assumed it executed sandboxed and deterministically and would therefore display uniformly barring interpreter bugs. Looking over the spec, I can't find where it's possible to read the current date, but the usertime/realtime variables are sufficient as long as the attacker According to wikipedia, PDF doesn't have conditionals or loops of any kind, so you probably mean a postscript file. Geoffrey --
usertime and realtime are from the start of the invocation of the postscript interpreter, not based on the outside world. So, the interpreter could wait arbitrarily long, but has no way of knowing any external reference to time. I could imagine trickery with PDF signatures and their expiration times, but you shouldn't be able to do anything with the information, so it would be an exploit, and would probably be fixed. David --
You guys are right - I misremembered the spec wrt dates. I had the distinct impression that there was a way to get the epoch. Sorry about the noise. martin -- martin.langhoff@gmail.com martin@laptop.org -- School Server Architect - ask interesting questions - don't get distracted with shiny stuff - working code first - http://wiki.laptop.org/go/User:Martinlanghoff --
Just to add my 2 cents, examples of this are available on the web, like: http://th.informatik.uni-mannheim.de/People/Lucks/HashCollisions/ Same size, same hash. But that's with md5, not sha1. -- Matthieu --
Well yes, but that's still using the methods already mentioned in this thread. So you do have to get your "good" code approved before replacing it with something nasty. - Fredrik -- Regards, Fredrik Skolmli --
Why not wait until the results of: are available. That will surely be soon enough. Tom --
Why not wait until the results of: http://www.csrc.nist.gov/groups/ST/hash/index.html are available. That will surely be soon enough (I think 2012 is the expected finish date), and should prevent having to switch again in the future. The necessity or otherwise of improving the hashing will be clearer by then too. Tom --
