Hi,
We're seeing serious performance issues with repos that store media files, even relatively small
files. For example, a web site with less than 100 MB of images can take minutes to commit, push, or
pull when images have changed.
Our first guess was that git is repeatedly attempting to compress/decompress data that had already
been compressed. We tried these configuration settings (shooting in the dark) to no avail:
core.compression 0 ## Docs say this disables compression. Didn't seem to work.
pack.depth 1 ## Unclear what this does.
pack.window 0 ## No idea what this does.
gc.auto 0 ## We hope this disables automatic packing.
Our guess that re-compression is to blame may not even be valid since we can manually re-compress
these files in seconds, not minutes.
Is there a trick to getting git to simply "copy files as is"? In other words, don't attempt to
compress them, don't attempt to "diff" them, just store/copy/transfer the files as-is?
Thanks,
-John
--
Search for `delta` attribute, which should be unset for files that you don't want for git to attempt (binary) delta against, in gitattributes manpage. P.S. There is also git-bigfiles project that migth be of interest to you. -- Jakub Narebski Poland ShadeHawk on #git --
That sounds way too slow from my experiences. I have a repository with 3
gigabytes of photos and videos. Committing 20M of new images takes a
second or two. The biggest slowdown is doing the sha1 over the new data
(which actually happens during "git add").
What version of git are you using? Have you tried "commit -q" to
suppress the diff at the end of commit?
Can you show us exactly what commands you're using, along with timings
so we can see where the slowness is?
For pushing and pulling, you're probably seeing delta compression, which
can be slow for large files (though again, minutes seems kind of slow to
me). It _can_ be worth doing for images, if you do things like change
only exif tags but not the image data itself. But if the images
themselves are changing, you probably want to try setting the "-delta"
attribute. Like:
echo '*.jpg -delta' >.gitattributes
Also, consider repacking your repository, which will generate a packfile
Git does spend a fair bit of time in zlib for some workloads, but it
That should disable zlib compression of loose objects and objects within
packfiles. It can save a little time for objects which won't compress,
but you will lose the size benefits for any text files.
But it won't turn off delta compression, which is what the
"compressing..." phase during push and pull is doing. And which is much
It says you can't make a chain of deltas deeper than 1. It's probably
It sets the number of other objects git will consider when doing delta
compression. Setting it low should improve your push/pull times. But you
will lose the substantial benefit of delta-compression of your non-image
files (and git's meta objects). So the "-delta" option above for
It disables automatic repacking when you have a lot of objects. You
_have_ to pack when pushing and pulling, since packfiles are the
on-the-wire format. What will help is:
1. Having repositories already packed, since git can re-use the packed
data.
2. Using -delta so that ...Thanks so much. It's version 1.5.6.5. I compiled it 3 months ago. For example, in one repo, there are 1200 source files, each on average 109K in size, for a total size of 127M. The largest source file is 82M. Most of the non-text source files are already compressed. I packed the bare repo, then ran `gc --aggressive`. Then I did a `git pull`, which took 35 minutes. The git processes in `top` seemed to peak at around 300M of memory. Since then, I added 'binary -delta' to the .gitattributes for various files, based on suggestions from this mailing list, but by that time did not wish to repeat the 35 minute pull to test it out. Let's hope that made a difference. You can simulate it all by generating a batch of 1-100 MB files from /dev/urandom (since they won't compress), commit them, then do it again many times to simulate edits. Every few iterates, push it somewhere. I noticed some other folks on this list apparently having the same issues, but they don't know it yet ("git hangs while compressing objects", etc.). That's probably the first symptom they'll see. It *appears* to hang, but it's really spinning away on the `pack` gizmo. I'm open to alternative suggestions -- some kind of dual-mode, where text files are "fully" version'd, diff'd, delta'd, index'd, stash'd, pack'd, compress'd, object'd and whatever else git needs to do, while non-text files are archived in a "lesser" manner. On the other hand, I get the sense that the LAST thing git needs is another "mode"! --
Hi John, Peff explained it very well. Some time ago, I had a similar problem: http://www.mentby.com/Group/git/how-to-prevent-git-from-compressing-certain-files.html and he helped me as well. Probably you may want to have a look at that thread. --
By git standards, that version is ancient. You may want to try with a
more recent version of git (at the very least, multithreaded delta
Note that "gc --aggressive" will repack from scratch, throwing away the
That sounds like a long time. What was taking so long? Was delta
compression pegging the CPU? Was it limited during the "Writing objects"
phase, which is going to be limited by either disk I/O or network speed?
How big is your packed repo? Given the pattern you describe below, I am
beginning to wonder if it is simply the case that even though a single
checkout of your repo isn't that large, the complete history of your
project may simply be gigantic (e.g., because you are repeatedly writing
new apparently-random versions of each file, so your repository size
will grow quite quickly).
Remember that a git clone transfers the full history (and a pull will
transfer all of the intermediate history). If you have rewritten those
files many times, you may be transferring many times your working
I tried this script to make a 100M working directory with a 400M .git
directory:
-- >8 --
#!/bin/sh
rm -rf big-repo
mkdir big-repo && cd big-repo && git init
mark() {
echo "`date` $*"
}
randomize() {
mark randomize start
for i in `seq 1 100`; do
openssl rand $((1024*1024)) >$i.rand
done
mark randomize end
}
commit() {
mark add start
git add .
mark add end
mark commit start
git commit -m "$1"
mark commit end
}
randomize; commit base
randomize; commit one
randomize; commit two
randomize; commit three
-- 8< --
Here are a few timings I noted:
- it takes about 5 seconds to generate and write the random data
- git add runs in about 13 seconds. It pegs the CPU hashing all of the
data.
- the first commit is nearly instantaneous, as the summary diff takes
no work; subsequent commits spend about 9 seconds to create the
summary diff. Changing commit to "commit -q" drops that to back to
...Heya, Do we respect the .gitattributef and not try to generate the diffstat for files that are uncompressable? -- Cheers, Sverre Rabbelier --
No, not to my knowledge. Even the "binary" attribute just says "this file is binary, don't text diff it". I think we will always still do rewrite-detection for operations like "git status" and the diff summary of "git commit". -Peff --
Heya, Would that not be a very sensible optimization that would help John (and other users of big files) a lot? -- Cheers, Sverre Rabbelier --
It might help some, but I worry about overloading the meaning of
"-delta". Right now it has a very clear meaning: don't delta for
packfiles. But that doesn't mean I might not want to see break detection
(or inexact rename detection, for that matter) at some time.
Large binary files shouldn't be taxing on regular diffs. If you have
marked a file as "binary" and we are not creating a binary diff (i.e.,
just printing "binary files differ"), then we shouldn't even need to
pull the blob from storage (since we can tell from the sha1 that it is
different). I haven't checked to see if we do that simple optimization
(if you haven't marked it with a binary attribute, then obviously we do
have to look at the blob to find out that it is binary).
So:
1. I think it would need a separate attribute that is about diffing
(possibly even just options to a custom diff filter).
2. I am not clear exactly what options would work best. Do you want to
disable diffing entirely? Disable just inexact rename detection and
break detection? If break detection is disabled, do you assume it
is _always_ a rewrite, or never?
So I am open to the idea, but I think we would need a more concrete
proposal and some timings to show how it is a benefit.
-Peff
--
Indeed. Please keep the delta attribute for what it is named after: deltas. And those are meant to be used in the context of object packing only. Nicolas --
I just compiled the latest git. It got worse!! $ git --version git version 1.5.6.5 $ time git gc --aggressive Counting objects: 2086, done. Compressing objects: 100% (2054/2054), done. Writing objects: 100% (2086/2086), done. Total 2086 (delta 676), reused 0 (delta 0) real 4m28.573s user 3m38.650s sys 0m5.156s $ git --version git version 1.7.1 $ time git gc --aggressive Counting objects: 2086, done. Compressing objects: 100% (2054/2054), done. Writing objects: 100% (2086/2086), done. Total 2086 (delta 676), reused 0 (delta 0) real 6m16.406s user 5m28.665s sys 0m6.196s $ du -hs .git The packed .git dir is 203 MB. Yes, we make frequent changes to these files, and push/pull frequently as well. Just a normal development pattern, though. It's all manually done -- i.e., It's definitely the pull/push in git. Not knowing my way around git internals at all, I don't know (nor do I really want to know, to be honest) which "sub-processes" of `git pull` or `git push` are the culprit. Yes, network bandwidth is always a factor, but I guess my expectation is that git shouldn't transfer too much more info than the amount of recent changes. For example, if we change 10 files for a total of 10MB, then my admittedly naive expectation is that git will send that 10MB of changes, plus some small constant amount of meta info... not the whole repo every time. No? --
Heya, I think that's got --aggressive got more aggressive :). We now do --window=200 and --depth=200 for --aggressive gc's. -- Cheers, Sverre Rabbelier --
I think Sverre is right that this is simply that --aggressive got more so in the last few versions. But do note that aggressive implies that we should pack from scratch, not reusing previously found deltas (or accepting that we didn't find deltas previously). So you might want "git gc --aggressive" the _first_ time you pack, or possibly even very occasionally. But if you are packing every day, you should just use "git gc", which will run much more quickly (and would probably have acceptable behavior even without the -delta attribute, as it would only have to look at _new_ objects). It will have to write the whole 200M packfile out each time, though. From your timings that looks to take about 50 seconds or so (just looking at the difference between wall clock time and CPU time, which is presumably spent in I/O). Packing nightly won't hurt, but is perhaps excessive. It sounds like you OK, that is not very big. Once packed, you really should not see Your assumption is correct. Git should transmit at _worst_ 10MB in such a scenario (i.e., often much less because of delta compression, but in your case of apparently-random media files, probably about 10MB). I wasn't clear from your message: you indicated the changes you made, but are you still having performance problems, or are you still waiting to get data? -Peff --
No. gc will only create a pack with new loose objects by default. Only if the number of packs grow too large will it combine them into one Packing nightly with a simple "git gc" i.e. without extra options should be perfectly fine. Nicolas --
I think that is only "gc --auto". With regular gc: $ git init $ echo content >file && git add file && git commit -m one $ git gc Counting objects: 3, done. Writing objects: 100% (3/3), done. Total 3 (delta 0), reused 0 (delta 0) $ du -a .git/objects/pack 4 .git/objects/pack/pack-5f6fe4b14529d73f51d7c8efa69306edd35f2302.idx 4 .git/objects/pack/pack-5f6fe4b14529d73f51d7c8efa69306edd35f2302.pack 12 .git/objects/pack $ echo content >>file && git commit -a -m two $ git gc Counting objects: 6, done. Delta compression using up to 2 threads. Compressing objects: 100% (2/2), done. Writing objects: 100% (6/6), done. Total 6 (delta 0), reused 3 (delta 0) $ du -a .git/objects/pack 4 .git/objects/pack/pack-ecf41a1c120eb911f50fdd2c159e94d5832974f7.idx 4 .git/objects/pack/pack-ecf41a1c120eb911f50fdd2c159e94d5832974f7.pack 12 .git/objects/pack So six objects written in the second gc, and obviously a brand new single pack. -Peff --
Argh. You're right. And "gc --auto" is already ran by many commands already. It is "git repack" that doesn't combine packs by default. Nicolas --
Just to follow up, the two solutions which have had a noticeable effect are,
first to run daily `gc`s, and, second, to configure a ".gitattributes" file as such:
*.jpg binary -delta
*.png binary -delta
*.psd binary -delta
*.gz binary -delta
*.bz2 binary -delta
.. and so on.
On my first go-round with ".gitattributes" (earlier in this thread), my patterns
were setup incorrectly, as in,
*.{gz,bz2,tgz,psd,png,jpg} binary -delta
Since git does not perform brace expansion, the above patterns never matched.
After revising the .gitattributes file, a ~6 minute gc dropped down to just
under ~3 minutes.
Is there any reason why someone would NOT want the above ".gitattributes"
defined by default?
--
Other than that our originally intended target audience are people who use git as a source code control system, not much. --
Ok, fair enough. It's your project, and you are defining "source control" as that which git supports: non-binary, line-by-line text only, C, bash .. no images, documents, etc. I only wish that definition of "source" had been more clear from the get-go. Perhaps a front and center blurb on the git home page or mission statement might clarify things for those of us who have different definitions of "source"? That way, you wouldn't have to be bothered by folks trying to version all their project assets with git. For example, you could specify that non-text is out of scope for git, (or however you wish to define "source"). --
I think both Junio's and my responses were not "we can't do a better job with non-text sources" but rather "this is how we ended up with the current state". I'll admit I have some reservations about trying to figure out a sane set of extensions for default .gitattributes, but that doesn't mean you I don't know that we want to explicitly discourage such use. Obviously certain workflows don't work as well with randomly-changing binary blobs (e.g., reading format-patch output is next to useless, though it does still work as a transport if your project relies on emailed patch submissions). In general, I think we are happy to take patches making binary storage more pleasant (e.g., textconv) as long as they don't somehow make the "normal" case of text worse. There are some things for which git is simply not well suited (single files in the gigabytes, for example), and those aren't likely to change because the some of the issues are fundamental to how git works (though there are often workarounds, like putting gigantic files in their own individual packs). But certainly 100M of jpgs does not seem like an unusable workload to me (as I mentioned, I have a several-gigabyte photo repository that git does just fine with). -Peff --
and other than that many people use clean/smudge filters to make git happily and efficiently deltify compressed file formats (such as gz, bz2, zip) and still keep compressed checkouts... and other than that which you (plural) and I are not thinking of right now. Let the defaults be as they are (fit for source control in the proper sense), it's easy enough to change them for other use cases. Michael --
That's fine. We all have different ideas what revision control means. So long as it's clear what git considers "source" and what it considers out of scope, what the defaults are, and what the limitations are, potential users can more fairly evaluate git to see if it fits their needs. For example, code libraries and shell utilities may not require anything more complicated than line-by-line text-based patches in revision control. need not contain any newlines), may present a problem for git in this respect. Perhaps a section in the manual with a header such as "Handling non-text files", or "Revision control for media, XML, and other non line-oriented files" would clear this all up. You could almost cull the body of it from this thread and other similar threads. --
That is indeed a good idea. Do you volunteer? Nicolas --
Yes, of course. If y'all are not up to it, I'd be happy to give it a shot. --
I delta jpgs in one of my repositories. It is useful if the exif metadata changes but the image data does not. I assume you could do the same with other formats which have compressed and uncompressed portions (I also do it with video containers). I don't think it would ever make sense to try to delta gzip'd or bzip'd contents. I also don't use "binary", as I use a custom diff driver instead (binary implies "-diff"). As for what should be the default, until now the default has always been that no gitattributes are defined by default. This is nice because it's simple to understand; git doesn't care about filenames unless you tell it to. The downside obviously is that it may not perform optimally for some unusual workloads without extra configuration. We could probably do defaults for some common extensions, but I'm not really sure where such a thing should end up. For example, I consider *.psd a uselessly obscure extension, as Adobe doesn't write software for my platform of choice. Not that I mind having it in git, but rather that we are inevitably going to miss somebody's pet extension, and then we are right back where we started with them needing to configure, except now they also have to figure out which extensions have default attributes. -Peff --
I agree, no defaults are better than arbitrary defaults. So why is the default "text"? --
Because git was designed as a source control system? -Peff --
