Re: serious performance issues with images, audio files, and other "non-code" data

Previous thread: fatal: Unable to find remote helper for 'http' by Laflen, Brandon (GE, Research) on Wednesday, May 12, 2010 - 8:08 am. (4 messages)

Next thread: Initial setup question by Noah Silverman on Wednesday, May 12, 2010 - 12:07 pm. (2 messages)
From: John
Date: Wednesday, May 12, 2010 - 11:53 am

Hi,

We're seeing serious performance issues with repos that store media files, even relatively small 
files. For example, a web site with less than 100 MB of images can take minutes to commit, push, or 
pull when images have changed.

Our first guess was that git is repeatedly attempting to compress/decompress data that had already 
been compressed. We tried these configuration settings (shooting in the dark) to no avail:

    core.compression 0   ## Docs say this disables compression. Didn't seem to work.
    pack.depth 1     ## Unclear what this does.
    pack.window 0    ## No idea what this does.
    gc.auto 0        ## We hope this disables automatic packing.

Our guess that re-compression is to blame may not even be valid since we can manually re-compress 
these files in seconds, not minutes.

Is there a trick to getting git to simply "copy files as is"?  In other words, don't attempt to 
compress them, don't attempt to "diff" them, just store/copy/transfer the files as-is?

Thanks,
  -John
--

From: Jakub Narebski
Date: Wednesday, May 12, 2010 - 12:15 pm

Search for `delta` attribute, which should be unset for files that you
don't want for git to attempt (binary) delta against, in gitattributes
manpage.

P.S. There is also git-bigfiles project that migth be of interest to
you.
-- 
Jakub Narebski
Poland
ShadeHawk on #git
--

From: Jeff King
Date: Thursday, May 13, 2010 - 10:10 pm

That sounds way too slow from my experiences. I have a repository with 3
gigabytes of photos and videos. Committing 20M of new images takes a
second or two. The biggest slowdown is doing the sha1 over the new data
(which actually happens during "git add").

What version of git are you using? Have you tried "commit -q" to
suppress the diff at the end of commit?

Can you show us exactly what commands you're using, along with timings
so we can see where the slowness is?

For pushing and pulling, you're probably seeing delta compression, which
can be slow for large files (though again, minutes seems kind of slow to
me). It _can_ be worth doing for images, if you do things like change
only exif tags but not the image data itself. But if the images
themselves are changing, you probably want to try setting the "-delta"
attribute. Like:

  echo '*.jpg -delta' >.gitattributes

Also, consider repacking your repository, which will generate a packfile

Git does spend a fair bit of time in zlib for some workloads, but it

That should disable zlib compression of loose objects and objects within
packfiles. It can save a little time for objects which won't compress,
but you will lose the size benefits for any text files.

But it won't turn off delta compression, which is what the
"compressing..." phase during push and pull is doing. And which is much

It says you can't make a chain of deltas deeper than 1. It's probably

It sets the number of other objects git will consider when doing delta
compression. Setting it low should improve your push/pull times. But you
will lose the substantial benefit of delta-compression of your non-image
files (and git's meta objects). So the "-delta" option above for

It disables automatic repacking when you have a lot of objects. You
_have_ to pack when pushing and pulling, since packfiles are the
on-the-wire format. What will help is:

  1. Having repositories already packed, since git can re-use the packed
     data.

  2. Using -delta so that ...
From: John
Date: Friday, May 14, 2010 - 5:54 am

Thanks so much. It's version 1.5.6.5. I compiled it 3 months ago. For example, 
in one repo, there are 1200 source files, each on average 109K in size, for a 
total size of 127M. The largest source file is 82M. Most of the non-text source 
files are already compressed.

I packed the bare repo, then ran `gc --aggressive`. Then I did a `git pull`, 
which took 35 minutes. The git processes in `top` seemed to peak at around 300M 
of memory. Since then, I added 'binary -delta' to the .gitattributes for various 
files, based on suggestions from this mailing list, but by that time did not 
wish to repeat the 35 minute pull to test it out. Let's hope that made a difference.

You can simulate it all by generating a batch of 1-100 MB files from 
/dev/urandom (since they won't compress), commit them, then do it again many 
times to simulate edits. Every few iterates, push it somewhere.


I noticed some other folks on this list apparently having the same issues, but 
they don't know it yet ("git hangs while compressing objects", etc.). That's 
probably the first symptom they'll see. It *appears* to hang, but it's really 
spinning away on the `pack` gizmo.

I'm open to alternative suggestions -- some kind of dual-mode, where text files 
are "fully" version'd, diff'd, delta'd, index'd, stash'd, pack'd, compress'd, 
object'd and whatever else git needs to do, while non-text files are archived in 
a "lesser" manner.  On the other hand, I get the sense that the LAST thing git 
needs is another "mode"!





--

From: Dirk Süsserott
Date: Friday, May 14, 2010 - 10:26 am

Hi John,

Peff explained it very well. Some time ago, I had a similar problem:
http://www.mentby.com/Group/git/how-to-prevent-git-from-compressing-certain-files.html
and he helped me as well. Probably you may want to have a look at that 
thread.


--

From: Jeff King
Date: Monday, May 17, 2010 - 4:16 pm

By git standards, that version is ancient. You may want to try with a
more recent version of git (at the very least, multithreaded delta

Note that "gc --aggressive" will repack from scratch, throwing away the

That sounds like a long time. What was taking so long? Was delta
compression pegging the CPU? Was it limited during the "Writing objects"
phase, which is going to be limited by either disk I/O or network speed?

How big is your packed repo? Given the pattern you describe below, I am
beginning to wonder if it is simply the case that even though a single
checkout of your repo isn't that large, the complete history of your
project may simply be gigantic (e.g., because you are repeatedly writing
new apparently-random versions of each file, so your repository size
will grow quite quickly).

Remember that a git clone transfers the full history (and a pull will
transfer all of the intermediate history). If you have rewritten those
files many times, you may be transferring many times your working

I tried this script to make a 100M working directory with a 400M .git
directory:

-- >8 --
#!/bin/sh

rm -rf big-repo
mkdir big-repo && cd big-repo && git init

mark() {
  echo "`date` $*"
}

randomize() {
  mark randomize start
  for i in `seq 1 100`; do
    openssl rand $((1024*1024)) >$i.rand
  done
  mark randomize end
}

commit() {
  mark add start
  git add .
  mark add end
  mark commit start
  git commit -m "$1"
  mark commit end
}

randomize; commit base
randomize; commit one
randomize; commit two
randomize; commit three
-- 8< --

Here are a few timings I noted:

  - it takes about 5 seconds to generate and write the random data

  - git add runs in about 13 seconds. It pegs the CPU hashing all of the
    data.

  - the first commit is nearly instantaneous, as the summary diff takes
    no work; subsequent commits spend about 9 seconds to create the
    summary diff.  Changing commit to "commit -q" drops that to back to
    ...
From: Sverre Rabbelier
Date: Monday, May 17, 2010 - 4:33 pm

Heya,


Do we respect the .gitattributef and not try to generate the diffstat
for files that are uncompressable?

-- 
Cheers,

Sverre Rabbelier
--

From: Jeff King
Date: Tuesday, May 18, 2010 - 12:07 pm

No, not to my knowledge. Even the "binary" attribute just says "this
file is binary, don't text diff it". I think we will always still do
rewrite-detection for operations like "git status" and the diff summary
of "git commit".

-Peff
--

From: Sverre Rabbelier
Date: Tuesday, May 18, 2010 - 12:10 pm

Heya,


Would that not be a very sensible optimization that would help John
(and other users of big files) a lot?

-- 
Cheers,

Sverre Rabbelier
--

From: Jeff King
Date: Tuesday, May 18, 2010 - 12:27 pm

It might help some, but I worry about overloading the meaning of
"-delta". Right now it has a very clear meaning: don't delta for
packfiles. But that doesn't mean I might not want to see break detection
(or inexact rename detection, for that matter) at some time.

Large binary files shouldn't be taxing on regular diffs.  If you have
marked a file as "binary" and we are not creating a binary diff (i.e.,
just printing "binary files differ"), then we shouldn't even need to
pull the blob from storage (since we can tell from the sha1 that it is
different). I haven't checked to see if we do that simple optimization
(if you haven't marked it with a binary attribute, then obviously we do
have to look at the blob to find out that it is binary).

So:

  1. I think it would need a separate attribute that is about diffing
     (possibly even just options to a custom diff filter).

  2. I am not clear exactly what options would work best. Do you want to
     disable diffing entirely? Disable just inexact rename detection and
     break detection? If break detection is disabled, do you assume it
     is _always_ a rewrite, or never?

So I am open to the idea, but I think we would need a more concrete
proposal and some timings to show how it is a benefit.

-Peff
--

From: Nicolas Pitre
Date: Tuesday, May 18, 2010 - 12:37 pm

Indeed. Please keep the delta attribute for what it is named after: 
deltas. And those are meant to be used in the context of object packing 
only.


Nicolas
--

From: John
Date: Tuesday, May 18, 2010 - 11:50 am

I just compiled the latest git. It got worse!!

$  git --version
git version 1.5.6.5
$ time git gc --aggressive
Counting objects: 2086, done.
Compressing objects: 100% (2054/2054), done.
Writing objects: 100% (2086/2086), done.
Total 2086 (delta 676), reused 0 (delta 0)

real    4m28.573s
user    3m38.650s
sys     0m5.156s
$  git --version
git version 1.7.1
$ time git gc --aggressive
Counting objects: 2086, done.
Compressing objects: 100% (2054/2054), done.
Writing objects: 100% (2086/2086), done.
Total 2086 (delta 676), reused 0 (delta 0)

real    6m16.406s
user    5m28.665s
sys     0m6.196s
$ du -hs .git


The packed .git dir is 203 MB. Yes, we make frequent changes to these files, and push/pull 
frequently as well. Just a normal development pattern, though. It's all manually done -- i.e., 




It's definitely the pull/push in git. Not knowing my way around git internals at all, I don't know 
(nor do I really want to know, to be honest) which "sub-processes" of `git pull` or `git push` are 
the culprit. Yes, network bandwidth is always a factor, but I guess my expectation is that git 
shouldn't transfer too much more info than the amount of recent changes. For example, if we change 
10 files for a total of 10MB, then my admittedly naive expectation is that git will send that 10MB 
of changes, plus some small constant amount of meta info... not the whole repo every time. No?

--

From: Sverre Rabbelier
Date: Tuesday, May 18, 2010 - 11:54 am

Heya,


I think that's got --aggressive got more aggressive :). We now do
--window=200 and --depth=200 for --aggressive gc's.

-- 
Cheers,

Sverre Rabbelier
--

From: Jeff King
Date: Tuesday, May 18, 2010 - 12:19 pm

I think Sverre is right that this is simply that --aggressive got more
so in the last few versions. But do note that aggressive implies that we
should pack from scratch, not reusing previously found deltas (or
accepting that we didn't find deltas previously).

So you might want "git gc --aggressive" the _first_ time you pack, or
possibly even very occasionally. But if you are packing every day, you
should just use "git gc", which will run much more quickly (and would
probably have acceptable behavior even without the -delta attribute, as
it would only have to look at _new_ objects).

It will have to write the whole 200M packfile out each time, though.
From your timings that looks to take about 50 seconds or so (just
looking at the difference between wall clock time and CPU time, which is
presumably spent in I/O).

Packing nightly won't hurt, but is perhaps excessive. It sounds like you

OK, that is not very big. Once packed, you really should not see

Your assumption is correct. Git should transmit at _worst_ 10MB in such
a scenario (i.e., often much less because of delta compression, but in
your case of apparently-random media files, probably about 10MB).

I wasn't clear from your message: you indicated the changes you made,
but are you still having performance problems, or are you still waiting
to get data?

-Peff
--

From: Nicolas Pitre
Date: Tuesday, May 18, 2010 - 12:33 pm

No.  gc will only create a pack with new loose objects by default.  
Only if the number of packs grow too large will it combine them into one 

Packing nightly with a simple "git gc" i.e. without extra options should 
be perfectly fine.


Nicolas
--

From: Jeff King
Date: Tuesday, May 18, 2010 - 12:41 pm

I think that is only "gc --auto". With regular gc:

  $ git init
  $ echo content >file && git add file && git commit -m one
  $ git gc
  Counting objects: 3, done.
  Writing objects: 100% (3/3), done.
  Total 3 (delta 0), reused 0 (delta 0)
  $ du -a .git/objects/pack
  4  .git/objects/pack/pack-5f6fe4b14529d73f51d7c8efa69306edd35f2302.idx
  4  .git/objects/pack/pack-5f6fe4b14529d73f51d7c8efa69306edd35f2302.pack
  12 .git/objects/pack

  $ echo content >>file && git commit -a -m two
  $ git gc
  Counting objects: 6, done.
  Delta compression using up to 2 threads.
  Compressing objects: 100% (2/2), done.
  Writing objects: 100% (6/6), done.
  Total 6 (delta 0), reused 3 (delta 0)
  $ du -a .git/objects/pack
  4  .git/objects/pack/pack-ecf41a1c120eb911f50fdd2c159e94d5832974f7.idx
  4  .git/objects/pack/pack-ecf41a1c120eb911f50fdd2c159e94d5832974f7.pack
  12 .git/objects/pack

So six objects written in the second gc, and obviously a brand new
single pack.

-Peff
--

From: Nicolas Pitre
Date: Tuesday, May 18, 2010 - 12:59 pm

Argh. You're right.  And "gc --auto" is already ran by many commands 
already.

It is "git repack" that doesn't combine packs by default.


Nicolas
--

From: John
Date: Sunday, May 23, 2010 - 5:21 pm

Just to follow up, the two solutions which have had a noticeable effect are, 
first to run daily `gc`s, and, second, to configure a ".gitattributes" file as such:

*.jpg  binary -delta
*.png  binary -delta
*.psd  binary -delta
*.gz  binary -delta
*.bz2  binary -delta
.. and so on.

On my first go-round with ".gitattributes" (earlier in this thread), my patterns 
were setup incorrectly, as in,

*.{gz,bz2,tgz,psd,png,jpg} binary -delta

Since git does not perform brace expansion, the above patterns never matched. 
After revising the .gitattributes file, a ~6 minute gc dropped down to just 
under ~3 minutes.

Is there any reason why someone would NOT want the above ".gitattributes" 
defined by default?





--

From: Junio C Hamano
Date: Sunday, May 23, 2010 - 6:16 pm

Other than that our originally intended target audience are people who use
git as a source code control system, not much.

--

From: John
Date: Monday, May 24, 2010 - 12:01 am

Ok, fair enough. It's your project, and you are defining "source control" as 
that which git supports: non-binary, line-by-line text only, C, bash .. no 
images, documents, etc.

I only wish that definition of "source" had been more clear from the get-go.

Perhaps a front and center blurb on the git home page or mission statement might 
clarify things for those of us who have different definitions of "source"?  That 
way, you wouldn't have to be bothered by folks trying to version all their 
project assets with git. For example, you could specify that non-text is out of 
scope for git, (or however you wish to define "source").



--

From: Jeff King
Date: Monday, May 24, 2010 - 11:33 pm

I think both Junio's and my responses were not "we can't do a better job
with non-text sources" but rather "this is how we ended up with the
current state".

I'll admit I have some reservations about trying to figure out a sane
set of extensions for default .gitattributes, but that doesn't mean you

I don't know that we want to explicitly discourage such use. Obviously
certain workflows don't work as well with randomly-changing binary blobs
(e.g., reading format-patch output is next to useless, though it does
still work as a transport if your project relies on emailed patch
submissions).

In general, I think we are happy to take patches making binary storage
more pleasant (e.g., textconv) as long as they don't somehow make the
"normal" case of text worse. There are some things for which git is
simply not well suited (single files in the gigabytes, for example), and
those aren't likely to change because the some of the issues are
fundamental to how git works (though there are often workarounds, like
putting gigantic files in their own individual packs). But certainly
100M of jpgs does not seem like an unusable workload to me (as I
mentioned, I have a several-gigabyte photo repository that git does just
fine with).

-Peff
--

From: Michael J Gruber
Date: Tuesday, May 25, 2010 - 12:28 am

and other than that many people use clean/smudge filters to make git
happily and efficiently deltify compressed file formats (such as gz,
bz2, zip) and still keep compressed checkouts...

and other than that which you (plural) and I are not thinking of right now.

Let the defaults be as they are (fit for source control in the proper
sense), it's easy enough to change them for other use cases.

Michael
--

From: John
Date: Tuesday, May 25, 2010 - 9:12 am

That's fine. We all have different ideas what revision control means. So long as it's clear what git 
considers "source" and what it considers out of scope, what the defaults are, and what the 
limitations are, potential users can more fairly evaluate git to see if it fits their needs.

For example, code libraries and shell utilities may not require anything more complicated than 
line-by-line text-based patches in revision control.

need not contain any newlines), may present a problem for git in this respect.

Perhaps a section in the manual with a header such as "Handling non-text files", or "Revision 
control for media, XML, and other non line-oriented files" would clear this all up. You could almost 
cull the body of it from this thread and other similar threads.
--

From: Nicolas Pitre
Date: Tuesday, May 25, 2010 - 10:18 am

That is indeed a good idea.

Do you volunteer?


Nicolas
--

From: John
Date: Tuesday, May 25, 2010 - 10:47 am

Yes, of course. If y'all are not up to it, I'd be happy to give it a shot.


--

From: Jeff King
Date: Sunday, May 23, 2010 - 10:39 pm

I delta jpgs in one of my repositories. It is useful if the exif
metadata changes but the image data does not. I assume you could do the
same with other formats which have compressed and uncompressed portions
(I also do it with video containers).  I don't think it would ever make
sense to try to delta gzip'd or bzip'd contents.

I also don't use "binary", as I use a custom diff driver instead (binary
implies "-diff").

As for what should be the default, until now the default has always
been that no gitattributes are defined by default. This is nice because
it's simple to understand; git doesn't care about filenames unless you
tell it to. The downside obviously is that it may not perform optimally
for some unusual workloads without extra configuration.

We could probably do defaults for some common extensions, but I'm not
really sure where such a thing should end up. For example, I consider
*.psd a uselessly obscure extension, as Adobe doesn't write software for
my platform of choice. Not that I mind having it in git, but rather that
we are inevitably going to miss somebody's pet extension, and then we
are right back where we started with them needing to configure, except
now they also have to figure out which extensions have default
attributes.

-Peff
--

From: John
Date: Sunday, May 23, 2010 - 11:44 pm

I agree, no defaults are better than arbitrary defaults.  So why is the default 
"text"?



--

From: Jeff King
Date: Sunday, May 23, 2010 - 11:45 pm

Because git was designed as a source control system?

-Peff
--

Previous thread: fatal: Unable to find remote helper for 'http' by Laflen, Brandon (GE, Research) on Wednesday, May 12, 2010 - 8:08 am. (4 messages)

Next thread: Initial setup question by Noah Silverman on Wednesday, May 12, 2010 - 12:07 pm. (2 messages)