[RFC] Optimize diff-delta.c

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: <git@...>
Cc: Nicolas Pitre <nico@...>, Martin Koegler <mkoegler@...>
Date: Tuesday, May 1, 2007 - 10:49 am

I try to use git with large blobs. Putting such blobs into pack files
is a slow operation and requires lots of memory. So I take a look at
the packing process.

As the delta format only supports 32 bit offsets, the uncompressed
blob size is limited to 4GB.

The delta index has approximately the same size in memory as the
uncompressed blob ((blob size)/16*(sizeof(index_entry)).
git-pack-objects keep the delta index of all objects in the search
window in memory.

So doing a delta of 4 GB files is totally unrealistic:
 (4GB data + ~4GB index)* window size [default: 10]= 80 GB in the worst case

In my case, the blobs are some hundred MB big, but git-pack-objects
already uses some GB of memory. As the memory requirement of
git-pack-objects is currently below the available memory of my system,
I need not to address this issue yet.

In the future, I'll propably need to create a patch to free big delta
indexes in find_delta immediatly, after create_delta returned. This
will increase the processing time, but better than not being able to 
pack objects.

I tried to speed up the delta generation by searching for a common
prefix, as my blobs are mostly append only. I tested it with about
less than 1000 big blobs. The time for finding the deltas decreased
from 17 to 14 minutes cpu time.

When repacking the git-repostiory itself, I get the following numbers:

Unmodified version (gcc-4.1):
$ echo | time ./git-pack-objects --non-empty --all --reflog --unpacked=pack-d44dc76d0e873a7c7566bcc4503731b9a5640b30.pack --no-reuse-delta  .git/.tmp-28449-pack
Generating pack...
Done counting 42553 objects.
Deltifying 42553 objects...
 100% (42553/42553) done
Writing 42553 objects...
 100% (42553/42553) done
d44dc76d0e873a7c7566bcc4503731b9a5640b30
Total 42553 (delta 29605), reused 12346 (delta 0)
63.82user 0.80system 1:06.64elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+29177minor)pagefaults 0swaps

Patched version (gcc-4.1):
$ echo | time ../git/git-pack-objects --non-empty --all --reflog --unpacked=pack-d44dc76d0e873a7c7566bcc4503731b9a5640b30.pack --no-reuse-delta  .git/.tmp-28448-pack
Generating pack...
Done counting 42553 objects.
Deltifying 42553 objects...
 100% (42553/42553) done
Writing 42553 objects...
 100% (42553/42553) done
d44dc76d0e873a7c7566bcc4503731b9a5640b30
Total 42553 (delta 29581), reused 12353 (delta 0)
62.13user 0.91system 1:05.07elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+39690minor)pagefaults 0swaps

So it can help improve the performance a little bit (62.13+0.91<->63.82+0.80) on normal
repositories.

The following patch is only for testing purposes and not cleaned up.

mfg Martin K
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
[RFC] Optimize diff-delta.c, Martin Koegler, (Tue May 1, 10:49 am)
Re: [RFC] Optimize diff-delta.c, Nicolas Pitre, (Tue May 1, 12:05 pm)
Re: [RFC] Optimize diff-delta.c, Johannes Schindelin, (Tue May 1, 11:51 am)