Hello there, I've searched the web for an answer, but i didn't find it and decided to take the risk of being yelled at. And post it here. Some of my projects require having binary files (firmware and other stuff) somewhere in the tree structure. Unfortunately these files are big - 50MB and more. After a couple of new versions arrive (and get committed) i end up with a repository way too big than necessary. The nature of these binary files is such that i don't care neither about their history nor older versions. Hence the question: is there an easy way to tell git not to bother about the history of these particular files and keep the most recent version only? cheers, Petko -
If you don't care about versioning those files, why would you use a version control system? Just store them somewhere else, and use symlinks. Dave. -
That is certainly a way of doing it. However, it will be much simpler and fast to be able to "git clone" and then "git pull" every once in a while. The alternative involves "cp -a" or most likely "scp -r" the binaries along with the repository and you can never be sure that both are in sync. Petko -
Hi, I think that you're missing the point of version control. It's not only about having an up-to-date source tree, but also about being able to go back to a certain revision. What you want is most likely covered by "rsync -au". Hth, Dscho -
No contradiction here. In my case old source code will work perfectly with new binaries/firmware. That's why i don't _need_ the history, only the latest stuff in order to save space. I do realize that what i am talking about is statistically microscopic scenario, but it does exist. If there's no such feature then i don't have Yeah, just like in the old days when "git pull" didn't do everything for you. thanks, Petko -
Hi, No, you _do_ miss the point here. You might _think_ that they work perfectly, but with revision control you want to have _exactly_ the same setup. You want to be able to go back to a certain _revision_ (including the then-current firmware). And that's what you don't want. So git is not for you. Ciao, Dscho -
I _know_ older code will work with new binaries. I know because i've done it many times and the application is the sort that is not going to forgive any frivolity. Unfortunately this is very specific to what i'm doing and does not apply I use git for SCV from day one. It's great. I was just thinking aloud about something i've stumbled upon. ;-) Petko -
You may be interested in the history, but the entire purpose of any version control system (not just Git) is to record exactly that: history. If the exact contents of these large binaries *really* don't matter, as you say they don't, than why don't you just commit one and never touch it again? Cheers, Wincent -
Unfortunately those binaries does change, although the process is slow and not very frequent. And this is why it pokes me in the eye - for changing a few bytes i end up with much larger repository. Petko -
For changing a few bytes you get with much larger repository? What happened to our packfiles? Is it possible for us to take a look at two versions of such a binary blob, one before the change and one after such a change that touches only a few bytes? -
Hi, Your subject is a little bit misleading, no? It's not about the binariness (git handles binary files just fine, thankyouverymuch), but about the not-tracking them. The answer is no. You cannot ask git to have the newest version of something, but not the old ones. It contradicts the distributedness of git, too. Hth, Dscho -
You're absobloodylutely correct. I was too preoccupied defining my problem I don't agree here. Assume that whatever you're working on require firmware for a device that won't change during the lifetime of the software project. The newest version of the said firmware is mostly bugfixes and you basically don't want to revert to the older ones. Consider the microcode for modern Pentiums, Core 2, etc. What i am trying to suggest is that there might be cases when you need something in the repository, but you don't want GIT to keep it's history nor it's predecessors. Leaving it out breaks the atomicity of such repository and makes the project management more complex. There's a few examples out there that shows how to solve this, but it seems inconvenient and involves branching, cloning, etc. Isn't it possible to add something like: "git nohistory firmware.bin" or "git nohistory -i-understand-this-might-be-dangerous firmware.bin" cheers, Petko -
You can always tag a blob (like junio-gpg-pub tag in git.git repository), but it wouldn't be in a working directory. But it would get distributed on clone. BTW. if those large binary files doesn't differ much between version, they should get well compressed even if you would store them normally, all revisions. -- Jakub Narebski Poland ShadeHawk on #git -
Unfortunately this is not the case. These binary blobs are already compressed and/or encrypted and adding even a few bytes ends up storing new version in full size. cheers, Petko -
You use git-hash-object to put file (-t blob) into the object database.
It would return sha1 of added object. Use git-tag to create tag to blob
(use returned sha1 for head). You can get file (to stdout) with
"git cat-file blob tagname^{blob}".
The file would be in object database, but not in working directory
Can't you store them uncompressed?
--
Jakub Narebski
Poland
-
Not really, but i can convert them into ascii format and store only the delta. This will admittedly increase the initial size of the repository, but hopefully not by much. Petko -
If you don't have the original uncompressed unencrypted file, what will converting them to ascii actually give you? Nicolas -
I hope that in the case of incremental changes (0 to 5MB file is the same, last 64KB are actually new) the delta will be small and should be able to compress well. This won't work for random changes along the length of the whole file. Petko -
But my question remains. If you cannot create good deltas out of your binary files, converting those binaries into ascii will do nothing to compression performance. Nicolas -
I did agree with you above. Binary or ascii it doesn't matter. The experiments so far shows that git is doing a good job finding deltas in binary files. This may be a solution for certain part of my files... Petko -
Not much. I was blindly hoping that bzip2 may compress such files better, though i haven't tested this hypothesis yet. Will let you know if the result is different from the obvious one. :-) Petko -
Please read carefully section about retagging / changing a tag in git-tag(1) manpage; you should take care about propagating change if you ever change the binary blob. Nothe that I haven't used the technique described. -- Jakub Narebski Poland -
If the encryption is halfway decent, a new IV/nonce will be chosen for each new revision, and you can't tell that two ciphertexts share a common prefix without fully decrypting them. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 -
But not versioning some files while versioning others breaks the atomicity of project version, which is at the core of git's model. There is no such thing as "this file is at revision X, but that one is at Not easily. It goes against the underlying data model at the core of git. How big are your firmware files? How often do they change, and how large are the changes? IOW, have you confirmed that repacking does not produce an acceptable delta, meaning you get versioning for very low space cost? -Peff -
Sigh. You are right. However, the said project is kind of exception. The binaries are there from the very beginning - they are indivisible part of the project and it won't work without them. This is why i am not worried if i revert to previous source code version, but actually check-out fresh binary - in my Changes don't happen too often, but the size of everything binary in the tree easily goes to about 100MB. Three commits later it ends up at about 300MB... cheers, Petko -
I did "git repack -f -a -d", but it didn't reduce the repository size. Those binaries are already compressed so any change adds up their size once again. cheers, Petko -
OK, that was the answer I was looking for; it looks like you are out of luck. BTW, the main space-saver in repacking is _not_ compression, but rather finding deltas between similar objects (e.g., two versions of the same file that, although large, differ only by a small amount). So even compressed files can still produce space savings during a repack, though perhaps not as well because of randomness introduced by the compression. As an experiment, it might be worth trying to store the uncompressed versions instead (git will delta _and_ compress them for you). -Peff -
I don't have them uncompressed. I can try to convert those files into ascii format and then save them in the repository. Since most changes are incremental git should be able to generate relatively small delta, which should compress well enough. Thanks for the hint. Petko -
That is unlikely to help, since git can find deltas in binary files just as easily as in text files. All you are doing is changing the encoding. Rogan -
This won't help, only make things slightly worse. The problem is when you compress the files, as soon as there is a difference between the two files, the output of the compressor will be entirely different for the remainder of the file. Unless you changes always are at the end of the file, you will end up pretty much entire versions in the repo. I'm using some repos that get regular 50MB tarballs checked into them. I convinced them to not compress the tarballs, and git does a fairly decent job of doing delta compression between them (although it needs quite a bit of RAM to do so). Dave -
