Re: git and binary files

Previous thread: git-svn: Internal error during fetch of mono repository by Mark Junker on Wednesday, January 16, 2008 - 3:39 am. (1 message)

Next thread: git-svn: Internal error during fetch of mono repository by Mark Junker on Wednesday, January 16, 2008 - 4:33 am. (1 message)
From: Petko Manolov
Date: Wednesday, January 16, 2008 - 3:34 am

Hello there,

I've searched the web for an answer, but i didn't find it and decided to 
take the risk of being yelled at.  And post it here.

Some of my projects require having binary files (firmware and other stuff) 
somewhere in the tree structure.  Unfortunately these files are big - 50MB 
and more.  After a couple of new versions arrive (and get committed) i end 
up with a repository way too big than necessary.

The nature of these binary files is such that i don't care neither about 
their history nor older versions.  Hence the question:  is there an easy 
way to tell git not to bother about the history of these particular files 
and keep the most recent version only?


cheers,
Petko
-

From: David Symonds
Date: Wednesday, January 16, 2008 - 3:54 am

If you don't care about versioning those files, why would you use a
version control system? Just store them somewhere else, and use
symlinks.


Dave.
-

From: Petko Manolov
Date: Wednesday, January 16, 2008 - 6:21 am

That is certainly a way of doing it.  However, it will be much simpler and 
fast to be able to "git clone" and then "git pull" every once in a while. 
The alternative involves "cp -a" or most likely "scp -r" the binaries 
along with the repository and you can never be sure that both are in sync.


 		Petko
-

From: Johannes Schindelin
Date: Wednesday, January 16, 2008 - 6:42 am

Hi,


I think that you're missing the point of version control.  It's not only 
about having an up-to-date source tree, but also about being able to go 
back to a certain revision.

What you want is most likely covered by "rsync -au".

Hth,
Dscho

-

From: Petko Manolov
Date: Wednesday, January 16, 2008 - 6:58 am

No contradiction here.  In my case old source code will work perfectly 
with new binaries/firmware.  That's why i don't _need_ the history, only 
the latest stuff in order to save space.

I do realize that what i am talking about is statistically microscopic 
scenario, but it does exist.  If there's no such feature then i don't have 

Yeah, just like in the old days when "git pull" didn't do everything for 
you.


thanks,
Petko
-

From: Johannes Schindelin
Date: Wednesday, January 16, 2008 - 7:07 am

Hi,


No, you _do_ miss the point here.  You might _think_ that they work 
perfectly, but with revision control you want to have _exactly_ the same 
setup.  You want to be able to go back to a certain _revision_ (including 
the then-current firmware).

And that's what you don't want.  So git is not for you.

Ciao,
Dscho
-

From: Petko Manolov
Date: Wednesday, January 16, 2008 - 7:21 am

I _know_ older code will work with new binaries.  I know because i've done 
it many times and the application is the sort that is not going to forgive 
any frivolity.

Unfortunately this is very specific to what i'm doing and does not apply 

I use git for SCV from day one.  It's great.  I was just thinking aloud 
about something i've stumbled upon.  ;-)


 		Petko
-

From: Wincent Colaiuta
Date: Wednesday, January 16, 2008 - 7:34 am

You may be interested in the history, but the entire purpose of any  
version control system (not just Git) is to record exactly that:  
history.

If the exact contents of these large binaries *really* don't matter,  
as you say they don't, than why don't you just commit one and never  
touch it again?

Cheers,
Wincent


-

From: Petko Manolov
Date: Wednesday, January 16, 2008 - 7:45 am

Unfortunately those binaries does change, although the process is slow and 
not very frequent.  And this is why it pokes me in the eye - for changing 
a few bytes i end up with much larger repository.


 		Petko
-

From: Junio C Hamano
Date: Wednesday, January 16, 2008 - 11:02 am

For changing a few bytes you get with much larger repository?
What happened to our packfiles?

Is it possible for us to take a look at two versions of such a
binary blob, one before the change and one after such a change
that touches only a few bytes?
-

From: Junio C Hamano
Date: Wednesday, January 16, 2008 - 11:09 am

I replied before reading the rest of the thread.  Please ignore
me on this one.
-

From: Johannes Schindelin
Date: Wednesday, January 16, 2008 - 4:54 am

Hi,


Your subject is a little bit misleading, no?  It's not about the 
binariness (git handles binary files just fine, thankyouverymuch), but 
about the not-tracking them.

The answer is no.  You cannot ask git to have the newest version of 
something, but not the old ones.  It contradicts the distributedness of 
git, too.

Hth,
Dscho

-

From: Petko Manolov
Date: Wednesday, January 16, 2008 - 6:39 am

You're absobloodylutely correct.  I was too preoccupied defining my problem 

I don't agree here.  Assume that whatever you're working on require 
firmware for a device that won't change during the lifetime of the 
software project.  The newest version of the said firmware is mostly 
bugfixes and you basically don't want to revert to the older ones. 
Consider the microcode for modern Pentiums, Core 2, etc.

What i am trying to suggest is that there might be cases when you need 
something in the repository, but you don't want GIT to keep it's history 
nor it's predecessors.  Leaving it out breaks the atomicity of such 
repository and makes the project management more complex.

There's a few examples out there that shows how to solve this, but it 
seems inconvenient and involves branching, cloning, etc.  Isn't it 
possible to add something like:

 	"git nohistory firmware.bin"

or
 	"git nohistory -i-understand-this-might-be-dangerous firmware.bin"



cheers,
Petko
-

From: Jakub Narebski
Date: Wednesday, January 16, 2008 - 6:53 am

You can always tag a blob (like junio-gpg-pub tag in git.git repository),
but it wouldn't be in a working directory. But it would get distributed
on clone.

BTW. if those large binary files doesn't differ much between version,
they should get well compressed even if you would store them normally,
all revisions.

-- 
Jakub Narebski
Poland
ShadeHawk on #git
-

From: Petko Manolov
Date: Wednesday, January 16, 2008 - 7:04 am

Unfortunately this is not the case.  These binary blobs are already 
compressed and/or encrypted and adding even a few bytes ends up storing 
new version in full size.


cheers,
Petko
-

From: Jakub Narebski
Date: Wednesday, January 16, 2008 - 7:20 am

You use git-hash-object to put file (-t blob) into the object database.
It would return sha1 of added object. Use git-tag to create tag to blob
(use returned sha1 for head). You can get file (to stdout) with 
"git cat-file blob tagname^{blob}".

The file would be in object database, but not in working directory

Can't you store them uncompressed?

-- 
Jakub Narebski
Poland
-

From: Petko Manolov
Date: Wednesday, January 16, 2008 - 7:43 am

Not really, but i can convert them into ascii format and store only the 
delta.  This will admittedly increase the initial size of the repository, 
but hopefully not by much.


 		Petko
-

From: Nicolas Pitre
Date: Wednesday, January 16, 2008 - 8:01 am

If you don't have the original uncompressed unencrypted file, what will 
converting them to ascii actually give you?


Nicolas
-

From: Petko Manolov
Date: Wednesday, January 16, 2008 - 8:18 am

I hope that in the case of incremental changes (0 to 5MB file is the same, 
last 64KB are actually new) the delta will be small and should be able to 
compress well.

This won't work for random changes along the length of the whole file.


 		Petko
-

From: Nicolas Pitre
Date: Wednesday, January 16, 2008 - 8:58 am

But my question remains.

If you cannot create good deltas out of your binary files, converting 
those binaries into ascii will do nothing to compression performance.


Nicolas
-

From: Petko Manolov
Date: Wednesday, January 16, 2008 - 9:06 am

I did agree with you above.  Binary or ascii it doesn't matter.

The experiments so far shows that git is doing a good job finding deltas 
in binary files.  This may be a solution for certain part of my files...


 		Petko
-

From: Petko Manolov
Date: Wednesday, January 16, 2008 - 9:09 am

Not much.  I was blindly hoping that bzip2 may compress such files better, 
though i haven't tested this hypothesis yet.  Will let you know if the 
result is different from the obvious one. :-)


 		Petko
-

From: Jakub Narebski
Date: Wednesday, January 16, 2008 - 9:34 am

Please read carefully section about retagging / changing a tag
in git-tag(1) manpage; you should take care about propagating change
if you ever change the binary blob.

Nothe that I haven't used the technique described.

-- 
Jakub Narebski
Poland
-

From: Florian Weimer
Date: Wednesday, January 16, 2008 - 9:41 am

If the encryption is halfway decent, a new IV/nonce will be chosen for
each new revision, and you can't tell that two ciphertexts share a
common prefix without fully decrypting them.

-- 
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99
-

From: Jeff King
Date: Wednesday, January 16, 2008 - 6:54 am

But not versioning some files while versioning others breaks the
atomicity of project version, which is at the core of git's model. There
is no such thing as "this file is at revision X, but that one is at

Not easily. It goes against the underlying data model at the core of
git.

How big are your firmware files? How often do they change, and how large
are the changes? IOW, have you confirmed that repacking does not produce
an acceptable delta, meaning you get versioning for very low space cost?

-Peff
-

From: Petko Manolov
Date: Wednesday, January 16, 2008 - 7:14 am

Sigh.  You are right.

However, the said project is kind of exception.  The binaries are there 
from the very beginning - they are indivisible part of the project and it 
won't work without them.  This is why i am not worried if i revert to 
previous source code version, but actually check-out fresh binary - in my 


Changes don't happen too often, but the size of everything binary in the 
tree easily goes to about 100MB.  Three commits later it ends up at about 
300MB...


cheers,
Petko
-

From: Jeff King
Date: Wednesday, January 16, 2008 - 7:18 am

Right, as loose objects. Did you try running "git-gc" to repack?

-Peff
-

From: Petko Manolov
Date: Wednesday, January 16, 2008 - 7:25 am

I did "git repack -f -a -d", but it didn't reduce the repository size. 
Those binaries are already compressed so any change adds up their size 
once again.


cheers,
Petko
-

From: Jeff King
Date: Wednesday, January 16, 2008 - 7:32 am

OK, that was the answer I was looking for; it looks like you are out
of luck.

BTW, the main space-saver in repacking is _not_ compression, but rather
finding deltas between similar objects (e.g., two versions of the same
file that, although large, differ only by a small amount). So even
compressed files can still produce space savings during a repack, though
perhaps not as well because of randomness introduced by the compression.

As an experiment, it might be worth trying to store the uncompressed
versions instead (git will delta _and_ compress them for you).

-Peff
-

From: Petko Manolov
Date: Wednesday, January 16, 2008 - 7:39 am

I don't have them uncompressed.

I can try to convert those files into ascii format and then save them in 
the repository.  Since most changes are incremental git should be able to 
generate relatively small delta, which should compress well enough.

Thanks for the hint.


 		Petko
-

From: Rogan Dawes
Date: Wednesday, January 16, 2008 - 8:05 am

That is unlikely to help, since git can find deltas in binary files just 
as easily as in text files. All you are doing is changing the encoding.

Rogan
-

From: David Brown
Date: Thursday, January 17, 2008 - 11:52 pm

This won't help, only make things slightly worse.

The problem is when you compress the files, as soon as there is a
difference between the two files, the output of the compressor will be
entirely different for the remainder of the file.  Unless you changes
always are at the end of the file, you will end up pretty much entire
versions in the repo.

I'm using some repos that get regular 50MB tarballs checked into them.  I
convinced them to not compress the tarballs, and git does a fairly decent
job of doing delta compression between them (although it needs quite a bit
of RAM to do so).

Dave
-

Previous thread: git-svn: Internal error during fetch of mono repository by Mark Junker on Wednesday, January 16, 2008 - 3:39 am. (1 message)

Next thread: git-svn: Internal error during fetch of mono repository by Mark Junker on Wednesday, January 16, 2008 - 4:33 am. (1 message)