Well, one thing to keep in mind is that for source code in particular,
this really very seldom is an issue.
So you can do a really *bad* job in theory, and in practice it really
works very very well.
Very few people keep binary blobs in any SCM archive _anyway_, partly
because they've always been told that it's unsafe (and with a lot of SCM's
it is), but even more because binary blobs are almost always generated by
some build method, so normally you'd never version them in the first
place, or versioning isn't all that helpful.
And most binary blobs are so *obviously* binary that even the stupidest
algorithm on earth will get it right. The only hard cases actually tend to
be really tiny files, or literally test-sequences.
Tiny files are hard because:
- they (by being tiny) have so few characters that they can easily lack
a "fingerprint" character (eg a NUL character or similar).
- tiny files are a lot more likely than bigger files to have strange
statistics that throw some more "sophisticated" rule off the scent.
Something like a "10% rule" tends to work fine if you have a big text,
and ten percent is still a reasonable number to average things out
over, but what if you only had ten characters to begin with?
The good news is that tiny files can usually be considered text, since
you'd seldom use a binary format for something really small anyway.
So I suspect that IN PRACTICE, especially if you come as a CVS replacement
(where binary files are just damn hard to get right even under the best of
circumstances!), you can do just about anything, including just saying
"everything is text", and you'd be fine.
It's entirely possible that that is exactly what CVSNT does ;)
Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html