login
Header Space

 
 

Re: Decompression speed: zip vs lzo

Previous thread: [PATCH] gitk: Update German translation. by Christian Stimming on Wednesday, January 9, 2008 - 5:24 pm. (3 messages)

Next thread: Re: [PATCH] Add [HOWTO] using merge subtree. by Junio C Hamano on Wednesday, January 9, 2008 - 6:33 pm. (2 messages)
To: Git Mailing List <git@...>
Date: Wednesday, January 9, 2008 - 6:01 pm

I have created a big tar from linux tree:

linux-2.6.tar   300,0 MB

Then I have created to compressed files with zip and lzop utility (the
latter uses the lzo compression algorithm):

linux-2.6.zip  70,1 MB

linux-2.6.tar.lzo  108,0 MB

Then I have tested the decompression speed:

$ time unzip -p linux-2.6.zip &gt; /dev/null
3.95user 0.09system 0:04.05elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+189minor)pagefaults 0swaps

$ time lzop -d -c linux-2.6.tar.lzo &gt; /dev/null
2.10user 0.07system 0:02.18elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+234minor)pagefaults 0swaps


So bottom line is that lzo decompression speed is almost the double of zip.


Marco

P.S: Compression size is better for zip but a more realistic test
would be to try with a delta packaged repo instead of a simple tar of
source files. Because delta packaged is already compressed in his way
perhaps difference in final file sizes is smaller.
-
To: Marco Costalba <mcostalba@...>
Cc: Git Mailing List <git@...>
Date: Wednesday, January 9, 2008 - 6:55 pm

Note that the space nor time performance of compressing and
uncompressing a single huge blob is not as interesting in the
context of git as compressing/uncompressing millions of small
pieces whose total size is comparable to the specimen of "huge
single blob" experiment.  Obviously loose object files are
compressed individually, and packfile contents are also
individually and independently compressed.  Set-up cost for
individual invocation of compression and uncompression on
smaller data matters a lot more than an experiment on
compressing and uncompressiong a single huge blob (this applies
to both time and space).


-
To: Marco Costalba <mcostalba@...>
Cc: Git Mailing List <git@...>, Junio C Hamano <gitster@...>
Date: Wednesday, January 9, 2008 - 7:23 pm

Yes - and lzo will almost certainly win on all those counts!

I think to go forward this would need a prototype and benchmark figures
for things like "annotate" and "fsck --full" - but bear in mind it would
be a long road to follow-up to completion, as repository compatibility
would need to be a primary concern and this essentially would create a
new pack type AND a new *object* type.  Not only that, but currently
there is no header in the objects on disk which can be used to detect a
gzip vs. an lzop stream.  Not really worth it IMHO - gzip is already
fast enough on even the most modern processor these days.

Sam.
-
To: Sam Vilain <sam@...>
Cc: Marco Costalba <mcostalba@...>, Git Mailing List <git@...>
Date: Wednesday, January 9, 2008 - 7:49 pm

For the compression type detection, I was hoping that we could
do something like sha1_file.c::legacy_loose_object(), but I tend
to agree it is not probably worth it.


-
To: Sam Vilain <sam@...>
Cc: Marco Costalba <mcostalba@...>, Git Mailing List <git@...>, Junio C Hamano <gitster@...>
Date: Wednesday, January 9, 2008 - 7:31 pm

Hi,


No new object type.  Why should it?  But it has to have a config variable 
which says what type of packs/loose objects it has (and you will not be 

I agree that gzip is already fast enough.

However, pack v4 had more goodies than just being faster; it also promised 
to have smaller packs.  And pack v4 would need to have the same 
infrastructure of repacking if the client does not understand v4 packs.

Ciao,
Dscho

-
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Sam Vilain <sam@...>, Marco Costalba <mcostalba@...>, Git Mailing List <git@...>, Junio C Hamano <gitster@...>
Date: Wednesday, January 9, 2008 - 11:41 pm

Right, like not having to compress tree objects and half of commit 
objects at all.


Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Sam Vilain <sam@...>, Git Mailing List <git@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 2:55 am

Decompression speed has been shown to be a bottle neck on some tests
involving mainly 'git log'.

Regarding back compatibility I really don't know at what level git
functions actually need to know the compression format, looking at the
code I would say at very low level, functions that deal directly with
inflate() and friends are few [1] and not directly connected to UI,
nor to git config. Is this compression format something user should
know/care? and if yes why?

In my tests the assumption of a source files tar ball is unrealistic,
to test the final size difference I would like testing different
compressions on a big already packaged but still not zipped file.
Someone could be so kind to hint me on how to create such a package
with good quality, i.e. with packaging levels similar to what is done
for public repos?

This does not realistically tests speed because as Junio pointed out
the real decompressing schema is different: many calls on small
objects, not one call on a big one. But if final size is acceptable we
can go on more difficult tests.

Marco

[1] where inflate() is called:

-inflate_it() in builtin-apply.c
-check_pack_inflate() in builtin-pack-objects.c
-get_data() in builtin-unpack-objects.c
-fwrite_sha1_file() in http-push.c and http-walker.c  [mmm interesting
same function in two files, also the signature and the contents seems
the same....]
-unpack_entry_data() in index-pack.c
-unpack_sha1_header(), unpack_sha1_rest(), get_size_from_delta(),
unpack_compressed_entry, write_sha1_from_fd() in sha1_file.c
-
To: Marco Costalba <mcostalba@...>
Cc: Nicolas Pitre <nico@...>, Johannes Schindelin <Johannes.Schindelin@...>, Sam Vilain <sam@...>, Git Mailing List <git@...>, Junio C Hamano <gitster@...>, <danahow@...>
Date: Thursday, January 10, 2008 - 3:34 pm

Thanks for looking into this,  in this email and your subsequent ones.

I agree that zip time is an issue.  I was looking into reducing the _number_
of zip calls on the same data,  but work and personal crises have reduced

The approach you're taking (here and in following emails) of being
able to make zip/lzo selection and measure the results should be
enlightening.  For the vast majority of git users,  Junio's scenario
is the most relevant.

Of additional interest to me is handling enormous objects more quickly.
I would like to replace some p4 usage here with git,  but most users will
only notice the speed difference and not use git's extra features.  Thus
they will compare git add/git commit/git push unfavorably to p4 edit/p4 submit,
because the former effectively does zip/unzip/zip/send,  while the latter
only does zip/send (git's extra "unzip/zip" comes from loose objects not
being directly copyable into packs).  This speed difference is irrelevant
for small to normal files,  but a killer when commiting a collection of say
100MB files.

Your lzo option could reduce this performance degradation vs p4 from
3x to close to 1.5x.  If you get it accepted,  I'd love to then "fix" the loose
object copying "problem" making git _faster_ than p4 on large files!
2 simple forms for this "fix" would be to use the once-and-future "new"
loose object format (an idea already rejected),  or to encode all loose
objects as singleton packs under .git/objects/xx (so that all (re)packing,
in the absence of new deltification,  becomes pack-to-pack copying).
This latter idea is a modification of an idea from Nicolas Pitre.
It certainly adds less code than other approaches for such a "fix".

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell
-
To: Nicolas Pitre <nico@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Sam Vilain <sam@...>, Git Mailing List <git@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 7:45 am

Looking at the git sources I have found that zip routines are
candidate for a cleaning up, as example the more or less very similar
lines of code are repeated many times in git files:

memset(&amp;stream, 0, sizeof(stream));
deflateInit(&amp;stream, pack_compression_level);
maxsize = deflateBound(&amp;stream, size);
out = xmalloc(maxsize);
stream.next_out = out;
stream.avail_out = maxsize;


So what I'm planning to do to test with different algorithms is first
a cleanup work that is more or less the following

- Remove #include &lt;zlib.h&gt; from cache.h and substitute with #include
"compress.h"

- Add #include &lt;zlib.h&gt; where it is "really" intended as example archive-zip.c

- Rename inflate()/deflate() and other zlib calls with corresponding
  zlib_inflate()
  zlib_deflate()

and declared in compress.h

- Define zlib_inflate() and friends as simple wrappers to
corresponding zlib function

- Test if everything is ok (should be only code shuffling/renaming until now)

- Start cleaning up as example adding a do_deflateInit() that wraps
all the code I have reported above and that involves deflateInit()

- When compression routines are cleaned up add new functions

do_inflate(), do_deflate() instead of zlib_* ones that wrap the
compression alghorithm dispatching logic.

Dispatching could be choose in different ways going from

- compile time (at #define level)
- config (some configuration value stored in some global variable)
- dynamic (at run time, with no configuration needed, I have some
ideas on this ;-)


Comments?

Thanks
Marco
-
To: Marco Costalba <mcostalba@...>
Cc: Nicolas Pitre <nico@...>, Sam Vilain <sam@...>, Git Mailing List <git@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 8:12 am

Hi,


No.  We will always need zlib for compatibility.  You cannot just replace 

We have a long tradition to have the system includes in cache.h.

Besides, if you have "compress.h" included in cache.h, which in turn has 
to include "zlib.h", what is the use of putting it also in archive-zip.c?

Ciao,
Dscho

-
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Nicolas Pitre <nico@...>, Sam Vilain <sam@...>, Git Mailing List <git@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 8:18 am

Ok. This was just to check what is broken by removing zlib.h so that
I'm sure to renaming all the zlib related stuff.

But I agree this is most a development detail and I can do this just
in my private tree to help me hacking the patches.

Thanks
Marco
-
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Git Mailing List <git@...>
Date: Wednesday, January 9, 2008 - 9:02 pm

I meant loose object.  However this is configured, it affects things
like HTTP push/pull.  Configuring like that would be a bit too fragile

Ineed - I think it would be a lot easier to implement if it didn't
bother with loose objects.  It can just be a new pack version with more
compression formats.  For when you know you're going to be doing a lot
of analysis you'd already run "git-repack -a -f" to shorten the deltas,
so this might be a useful option for some - but again I'd want to see
figures first.

I do really like LZOP as far as compression algorithms go.  It seems a
lot faster for not a huge loss in ratio.

Sam.
-
To: Git Mailing List <git@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 1:02 am

Coincidentally, I read this today on an algorithm (LZMA - same as 7zip)
which is very slow to compress, high ratio but quick decompression:

  http://use.perl.org/~acme/journal/35330

Which sounds excellent for squeezing those "archive packs" into even
more ridiculously tiny spaces.

Samn.
-
To: Sam Vilain <sam@...>
Cc: Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 5:16 am

Well, lzma is excellent for *big* chunks of data, but not that impressive f=
or
small files:

$ ll git.c git.c.gz git.c.lzma git.c.lzop
-rw-r--r-- 1 madcoder madcoder 12915 2008-01-09 13:47 git.c
-rw-r--r-- 1 madcoder madcoder  4225 2008-01-10 10:00 git.c.gz
-rw-r--r-- 1 madcoder madcoder  4094 2008-01-10 10:00 git.c.lzma
-rw-r--r-- 1 madcoder madcoder  5068 2008-01-10 09:59 git.c.lzop


And lzma performs really bad if you have few memory available. The "big" se=
cret
of lzma is that it basically works with a huge window to check for repetiti=
ve
data, and even decompression needs quite a fair amount of memory, making it=
 a
really bad choice for git IMNSHO.

Though I don't agree with you (and some others) about the fact that gzip is
fast enough. It's clearly a bottleneck in many log related commands where y=
ou
would expect it to be rather IO bound than CPU bound.  LZO seems like a fai=
rer
choice, especially since what it makes gain is basically the compression of=
 the
biggest blobs, aka the delta chains heads. It's really unclear to me if we
really gain in compressing the deltas, trees, and other smallish informatio=
ns.

And when it comes to times, for a big file enough to give numbers, here are=
 the
decompression times (best of 10 runs, smaller is better, second number is t=
he
size of the packed data, original data was 7.8Mo):
  * lzma: 0.374s (2.2Mo)
  * gzip: 0.127s (2.9Mo)
  * lzop: 0.053s (3.2Mo)

For a 300k original file:
  * lzma: 0.022s (124Ko)
  * gzip: 0.008s (144Ko)
  * lzop: 0.004s (156Ko) /* most of the samples were actually 0.005 */

What is obvious to me is that lzop seems to take 10% more space than gzip,
while being around 1.5 to 2 times faster. Of course this is very sketchy an=
d a
real test with git will be better.
--=20
=C2=B7O=C2=B7  Pierre Habouzit
=C2=B7=C2=B7O                                                madcoder@debia=
n.org
OOO                                                http://www.madism.org
To: Pierre Habouzit <madcoder@...>
Cc: Sam Vilain <sam@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 4:39 pm

This is really the big point here.  Git uses _lots_ of *small* objects, 
usually much smaller than 12KB.  For example, my copy of the gcc 
repository has an average of 270 _bytes_ per compressed object, and 
objects must be individually compressed.

Performance with really small objects should be the basis for any 

The delta heads, though, are far from being the most frequently accessed 
objects.  First they're clearly in minority, and often cached in the 

Remember that delta objects represent the vast majority of all objects. 
For example, my kernel repo currently has 555015 delta objects out of 
677073 objects, or 82% of the total.  There is actually only 25869 non 
deltified blob objects which are likely to be the larger objects, but 
they represent only 4% of the total.

But just let's try not compressing delta objects so to check your 
assertion with the following hack:

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index a39cb82..252b03e 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -433,7 +433,10 @@ static unsigned long write_object(struct sha1file *f,
 		}
 		/* compress the data to store and put compressed length in datalen */
 		memset(&amp;stream, 0, sizeof(stream));
-		deflateInit(&amp;stream, pack_compression_level);
+		if (obj_type == OBJ_REF_DELTA || obj_type == OBJ_OFS_DELTA)
+			deflateInit(&amp;stream, 0);
+		else
+			deflateInit(&amp;stream, pack_compression_level);
 		maxsize = deflateBound(&amp;stream, size);
 		out = xmalloc(maxsize);
 		/* Compress it */

You then only need to run 'git repack -a -f -d' with and without the 
above patch.

Here's my rather surprising results:

My kernel repo pack size without the patch:	184275401 bytes
Same repo with the above patch applied:		205204930 bytes

So it is only 11% larger.  I was expecting much more.


Right.  Abstracting the zlib code and having different compression 
algorithms tested in the Git context is the only way to do meaningful 
comparisons.


Ni...
To: Nicolas Pitre <nico@...>
Cc: Pierre Habouzit <madcoder@...>, Sam Vilain <sam@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Friday, January 11, 2008 - 10:18 am

If it so happens that one algorithm does much better on small objects
while another does better on large objects, there really is nothing that
prevents using both in a repository.  It's a bit of code bloat, of course.

Morten
-
To: Nicolas Pitre <nico@...>
Cc: Sam Vilain <sam@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Friday, January 11, 2008 - 5:45 am

Using as a PoC a test that is if (size &lt;=3D 512) instead, I get:

vanilla git:

$ du -k .git/**/*.pack
180808 .git/objects/pack/pack-7bc9f383c92cbffe366da2d2a62b67bb33a53365.pack
$ repeat 5 time git blame MAINTAINERS &gt;|/dev/null
git blame MAINTAINERS &gt;| /dev/null  7,34s user 0,09s system 99% cpu 7,433 t=
otal
git blame MAINTAINERS &gt;| /dev/null  7,31s user 0,16s system 100% cpu 7,475 =
total
git blame MAINTAINERS &gt;| /dev/null  7,35s user 0,08s system 100% cpu 7,431 =
total
git blame MAINTAINERS &gt;| /dev/null  7,30s user 0,18s system 99% cpu 7,482 t=
otal
git blame MAINTAINERS &gt;| /dev/null  7,33s user 0,16s system 99% cpu 7,492 t=
otal


With a compression disabled for sizes &lt;=3D 512:

$ du -k .git/**/*.pack
188840.git/objects/pack/pack-7bc9f383c92cbffe366da2d2a62b67bb33a53365.pack
$ repeat 5 time git blame MAINTAINERS &gt;|/dev/null
git blame MAINTAINERS &gt;| /dev/null  7,06s user 0,09s system 100% cpu 7,150 =
total
git blame MAINTAINERS &gt;| /dev/null  7,08s user 0,13s system 99% cpu 7,209 t=
otal
git blame MAINTAINERS &gt;| /dev/null  7,07s user 0,08s system 99% cpu 7,168 t=
otal
git blame MAINTAINERS &gt;| /dev/null  7,02s user 0,15s system 99% cpu 7,177 t=
otal
git blame MAINTAINERS &gt;| /dev/null  7,07s user 0,13s system 99% cpu 7,243 t=
otal

Okay, the size doesn't even budge, it's not even near being fun. Though
we gain 3% of wall clock time


Let's try with a limit of 1024 then !

$ du -k .git/**/*.pack
201725	.git/objects/pack/pack-7bc9f383c92cbffe366da2d2a62b67bb33a53365.pack
$ repeat 5 time git blame MAINTAINERS &gt;|/dev/null
git blame MAINTAINERS &gt;| /dev/null  6,93s user 0,16s system 77% cpu 9,109 t=
otal
git blame MAINTAINERS &gt;| /dev/null  6,88s user 0,08s system 99% cpu 6,965 t=
otal
git blame MAINTAINERS &gt;| /dev/null  6,84s user 0,10s system 99% cpu 6,952 t=
otal
git blame MAINTAINERS &gt;| /dev/null  6,86s user 0,12s system 99% cpu 6,983 t=
otal
git blame MAINTAINERS &gt;| /dev/null  6,81s user 0,18s system ...
To: Pierre Habouzit <madcoder@...>
Cc: Sam Vilain <sam@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Friday, January 11, 2008 - 10:27 am

Well, sorry but that doesn't count to me.  The whole 'git log' taking 
around 2 seconds is already hell fast for what it does, and IMHO this is 
not worth increasing the repository storage size for this particular 

If that was 43% reduction of a 10 second operation then sure I would 
agree, like the blame operation typically is.  But otherwise the 
significant storage size increase is not worth the reduction of less 

No, I doubt it would.  The bulk of 'git gc --auto' will reuse existing 

Well, I was initially entousiastic about this avenue, but the speed 
performance difference is far from impressive IMHO, given the tradeoff.


Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: Pierre Habouzit <madcoder@...>, Sam Vilain <sam@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 5:51 pm

The first thing I would like to test when zlib abstraction is ready is
to test with NULL compressor, i.e. not compression/decompression at
all and see if 'git log' and friends are happy.

BTW would be possible to test git with zlib disabled also now? I mean
there is a quick hack to disable zlib not only in writing but also in
reading, so that we can see what happens when running a repository
packed without compression?


Thanks
Marco
-
To: Marco Costalba <mcostalba@...>
Cc: Pierre Habouzit <madcoder@...>, Sam Vilain <sam@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 6:18 pm

Easy: git config core.compression 0


Nicolas
-
To: Marco Costalba <mcostalba@...>
Cc: Nicolas Pitre <nico@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 6:01 pm

See Nicholas Pitre's hack on another branch of this thread - it won't
cut out zlib entirely, but at least it's just configuring it to do plain
pass-through.  You can probably just replace pack_compression_level with 0.

Sam
-
To: Nicolas Pitre <nico@...>
Cc: Pierre Habouzit <madcoder@...>, Sam Vilain <sam@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 5:01 pm

It's probably worth doing those statistics on some other projects.

The kernel has for the last five+ years very much encouraged people to 
make series of small changes, so I would not be surprised if it turns out 
that the deltas for the kernel are smaller than average, if only because 
the whole development process has encouraged people to send in a series of 
ten patches rather than a single larger one.

And there are basically *no* generated files in the kernel source repo.

Maybe the difference to other repositories isn't huge, and maybe the 
kernel *is* a good test-case, but I just wouldn't take that for granted. 

Yes, delta's are bound to compress much less well than non-deltas, and 
especially for tree objects (which is a large chunk of them) they probably 
compress even less (because a big part of the delta is actually just the 
SHA1 changes), but if it's 11% on the kernel, it could easily be 25% on 
something else.

Try with the gcc repo, especially the one that has deep delta chains (so 
it has even *more* deltas in relation to full objects than the kernel has)

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Nicolas Pitre <nico@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 5:45 pm

For reference, 20 years of Perl with very deep deltas:

wilber:~/src/perl-preview$ du -sk .git
73274   .git
wilber:~/src/perl-preview$ git-repack -a
Counting objects: 244360, done.
Compressing objects: 100% (55493/55493), done.
Writing objects: 100% (244360/244360), done.
Total 244360 (delta 181061), reused 244360 (delta 181061)
wilber:~/src/perl-preview$ du -sk .git/objects/pack/
75389   .git/objects/pack/
wilber:~/src/perl-preview$

There are a few generated files in this history, but really only yacc
files etc.  It is in general also a lot of small changes.

Sam.
-
To: Sam Vilain <sam@...>
Cc: Nicolas Pitre <nico@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 6:03 pm

Hmm. I'm not sure I understand what this was supposed to show?

You reused all the old deltas, and you did "du -sk" on two different 
things before/after (and didn't do a "-a -d" to repack the old pack 
either). So does the result actually have anything to do with any 
compression algorithm?

Use "-a -d -f" to repack a whole archive.

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Nicolas Pitre <nico@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 6:28 pm

Drat, guess that means I'll have to recompute the deltas - I was trying
to avoid that.

Ok, see you in an hour or two, hopefully sans bonehead mistakes this time :)

Sam.
-
To: Sam Vilain <sam@...>
Cc: Nicolas Pitre <nico@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 6:56 pm

Well, you could try to reuse the delta base information itself, but then 
recompute the actual delta data contents. It would require some 
source-code changes, but that may be faster (and result in a more accurate 
before/after picture) than actually recomputing the deltas.

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Nicolas Pitre <nico@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 9:01 pm

Yes, it would - but my runs have finished.

Without compression of deltas:

wilber:~/src/perl-preview$ git-repack -a -d -f --window=250 --depth=100
Compressing objects: 100% (236554/236554), done.
Writing objects: 100% (244360/244360), done.
Total 244360 (delta 182343), reused 0 (delta 0)
wilber:~/src/perl-preview$ du -sk .git/objects/pack/
86781   .git/objects/pack/

With compression of deltas:

wilber:~/src/perl-preview$ time git-repack -a -d -f --window=250 --depth=100
Counting objects: 244360, done.
Compressing objects: 100% (236554/236554), done.
Writing objects: 100% (244360/244360), done.
Total 244360 (delta 182343), reused 0 (delta 0)

real    20m34.985s
user    20m1.003s
sys     0m25.558s
wilber:~/src/perl-preview$ du -sk .git/objects/pack/
72907   .git/objects/pack/

wilber:~/src/perl-preview$ git --version
git version 1.5.4.rc2.7.g079c9-dirty

Of course those compression parameters are quite insane.

And as a side note either repack-objects got significantly better about
memory use between 1.5.3.5 and that version (the OOM killer fired -
killing first firefox and thunderbird :)) or apparently running
git-repack with a ulimit stops it from allocating too much VM.

Sam.
-
To: Sam Vilain <sam@...>
Cc: Nicolas Pitre <nico@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 10:10 pm

Ok, so non-compressed deltas are 20% bigger.

That may well be a perfectly acceptable trade-off if the end result is 
then a lot faster. Has somebody done performance numbers? I may have 
missed them.. The best test is probably something like "git blame" on a 
file that takes an appreciable amount of time.

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Nicolas Pitre <nico@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Friday, January 11, 2008 - 2:29 am

The difference seems only barely measurable;

wilber:~/src/perl-preview$ time git annotate sv.c &gt;/dev/null

real    0m8.130s
user    0m6.712s
sys     0m1.412s

wilber:~/src/perl-preview-loose$ time git annotate sv.c &gt;/dev/null

real    0m7.930s
user    0m6.480s
sys     0m1.408s

(each one is last of three runs - dual-core x86_64 @ 2.1GHz w/512KB cache)

sv.c has about 1500 revisions, though the oldest line is    I also tried
annotate and log on the YACC generated parser which only has about 165
revisions, with similar results - a very minor difference or no difference.

Sam
-
To: Sam Vilain <sam@...>
Cc: Nicolas Pitre <nico@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Friday, January 11, 2008 - 12:03 pm

[Empty message]
To: Linus Torvalds <torvalds@...>
Cc: Nicolas Pitre <nico@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Friday, January 11, 2008 - 9:52 pm

Well, my figures agree with Pierre I think - 6-10% time savings for
'git annotate'.

I think Pierre has hit the nail on the head - that skipping
compression for small objects is a clear win.  He saw the obvious
criterion, really.  I've knocked it up as a config option that doesn't
change the default behaviour below.

I can't help but speculate what benefits having a range of one or two
of the most elite compression algorithms (eg, lzop or even lzma for
the larger blobs) available would be, in general.  eg, if gzip takes a
stream longer than X kb to offer substantial benefits over lzop, lzop
the ones shorter than that.

If the uncompressed objects are clustered in the pack, then they might
stream compress a lot better, should they be tranmitted over a http
transport with gzip encoding.  In packs which should be as small as
possible, with a format change they could be distributed as one
compressed resource.  The ordering of the objects would ideally be
selected such that it results in optimum compression - which could add
a savings akin to bzip2 vs gzip, at the expense of having to scan the
small objects for mini-deltas and arrange them clustering objects
which share these mini-deltas.

Well, interesting ideas anyway :)

Subject: [PATCH] pack-objects: add compressionMinSize option

Objects smaller than a page don't save much space when compressed, and
cause some overhead.  Allow the user to specify a minimum size for
objects before they are compressed.

Credit: Pierre Habouzit &lt;madcoder@debian.org&gt;
Signed-off-by: Sam Vilain &lt;sam.vilain@catalyst.net.nz&gt;
---
 Documentation/config.txt |    5 +++++
 builtin-pack-objects.c   |    7 ++++++-
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 1b6d6d6..245121e 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -734,6 +734,11 @@ pack.compression::
 	compromise between speed and compression (currently equivalent
 	to level 6)....
To: Sam Vilain <sam@...>
Cc: Linus Torvalds <torvalds@...>, Nicolas Pitre <nico@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>
Date: Saturday, January 12, 2008 - 12:46 am

That would only have been a sensible optimization in older
native pack protocol, where we always exploded the transferred
packfile.  However, these days, we tend to keep the packfile and
re-index at the receiving end (http transport never exploded the
packfile and it still doesn't).  When used that way, choosing
object layout in packfile in such a way to ignore recency order
and cluster objects by their delta chain, which you are
advocating to reduce the transfer overhead, is a bad tradeoff.
Your packs will be kept in the form you chose for transport,
which is a layout that hurts the runtime performance.  And you
keep using that suboptimal packs number of times, getting hurt

I very much like the simplicity of the patch.  If such a simple
approach can give us a clear performance gain, I am all for it.

Benchmarks on different repositories need to back that up,
though.
-
To: Sam Vilain <sam@...>
Cc: Linus Torvalds <torvalds@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Friday, January 11, 2008 - 10:32 pm

Sorry to rain on your parade, but to me 6-10% time saving is not a clear 
win at all, given the equal increase in repository size.  This is simply 
not worth it.

And a 50% time saving on an operation, such a git log, which takes less 
than 2 seconds in absolute time, is not worth the repo size increase 
either.  Going from 2 seconds down to one second doesn't make enough of 
a user experience difference.

If git blame was to go from 10 seconds down to 4 then I'd say this is a 
clear win.  But this is not the case.


Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: Linus Torvalds <torvalds@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Friday, January 11, 2008 - 11:06 pm

Disagree.  Going as much as twice as fast for many history operations
for 10% added space sounds like a clear win to me.  We can easily agree
to disagree though - making it a disabled by default config option

What do you mean?  1 second waiting is far better than 2 seconds
waiting.  And the mmap optimizations have not even begun yet - that

This is an awesome boost!  Everything feels snappier already :)

maia:~/src/perl.clean$ time git-log | LANG=C wc
 288927  894027 8860916

real    0m0.839s
user    0m0.824s
sys     0m0.144s
maia:~/src/perl.clean$ cd ../perl.clean.loose/
maia:~/src/perl.clean.loose$ time git-log | LANG=C wc
 288927  894027 8860916

real    0m0.515s
user    0m0.504s
sys     0m0.136s

maia:~/src/perl.clean.loose$ du -sk .git/objects/pack/
113484  .git/objects/pack/
maia:~/src/perl.clean.loose$ cd -
/home/samv/src/perl.clean
maia:~/src/perl.clean$ du -sk .git/objects/pack/
107040  .git/objects/pack/
maia:~/src/perl.clean$

Want me to try this on kde.git?

Sam.
-
To: Sam Vilain <sam@...>
Cc: Linus Torvalds <torvalds@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Saturday, January 12, 2008 - 12:09 pm

If you can come with a real life scenario, and not simply a simple test 
having little relevance with typical usage, that shows a clear reduction 
in execution time which is human perceptible, then I'll agree with you.  
But doing a full history log taking one second instead of two isn't a 
good enough argument  to me for making the repository many megabytes 
larger.  Again if it was 'git blame' using 5 seconds instead of 10 then 
I would agree that this is a clear win, even if this is also a 50% 
execution time reduction.  But human perception is way more important 
when it is 10 secs down to 5 compared to 2 secs down to 1.

This proposed change isn't free, because you have to introduce a 
regression in one place in order to make a gain somewhere else. The pack 
v4 format that I developed with Shawn, though, was showing _both_ a 
speed gain and a repository size reduction, hence there is no regression 

I suppose we do.


Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: Sam Vilain <sam@...>, Linus Torvalds <torvalds@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Saturday, January 12, 2008 - 12:44 pm

Hi,


I have to agree with Nicolas.  A full history log is such a rare occasion 
that it is not worth optimising for.

When I call "git log", it typically shows me the first commit 
_instantaneously_, which is plenty fast enough for me, especially given 
that I quit it right away or after a few pages more often than not.

Ciao,
Dscho

-
To: Sam Vilain <sam@...>
Cc: Linus Torvalds <torvalds@...>, Nicolas Pitre <nico@...>, Pierre Habouzit <madcoder@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Friday, January 11, 2008 - 3:05 am

only about 900 revisions old.
-
To: Linus Torvalds <torvalds@...>
Cc: Pierre Habouzit <madcoder@...>, Sam Vilain <sam@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Thursday, January 10, 2008 - 5:30 pm

Obviously.

This was a really crud test, and my initial goal was to quickly dismiss 
Pierre's assertion.  Turns out that he wasn't that wrong after all, and 
if a significant increase in access speed by avoiding zlib for 82% of 
object accesses can also be demonstrated for the kernel, then we have an 
opportunity for some optimization tradeoff with no backward 

Right.  But again this is not worth pursuing if a significant speed 
increase in repo access is not demonstrated at least with the kernel.


Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: Linus Torvalds <torvalds@...>, Sam Vilain <sam@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>, Marco Costalba <mcostalba@...>, Junio C Hamano <gitster@...>
Date: Friday, January 11, 2008 - 4:57 am

Well that wasn't a random assertion, I made it, because I assumed that
a delta is usually less than a few hundred bytes, and as compression is
applied only to the delta without context, you end up packing 500 bytes

  Well, one could use the fact that deltas are not packed to avoid
copying them around, and that will _necessarily_ become a gain (you can
read them where they have been mmapped for instance). The number that
were given for git annotate use a compression of `0' which doesn't use
that fact, and I wouldn't be surprised to see a noticeable gain if one
does that.

  And actually, maybe that it's not the deltas we should not pack, but
objects under a certain size (say 512 bytes e.g. ?), whichever type they
have, and to have the code exploit that fact for real, and avoid copies.
With this criterion, I expect the repository to not grow a lot larger
(I'd say quite less than the 10% you had, as even in the kernel, there
_are_ some larger deltas, and we definitely loose space for them, I'd
expect less than a 5% size variation), and I _think_ it's worth
investigating. At least I expect visible results on commands (like blame
of even log[0]) that go through a lot of small objects to see 10 to 20%
increase speed (backed up by some experience I have in avoiding copies
in not-so-similar cases though, so it may be less, and I'll stand
corrected -- and disappointed, a bit).

  [0] If I'm correct commit messages are "objects" on their own, and I
      don't expect them to be very often over 512 octets.
--=20
=C2=B7O=C2=B7  Pierre Habouzit
=C2=B7=C2=B7O                                                madcoder@debia=
n.org
OOO                                                http://www.madism.org
Previous thread: [PATCH] gitk: Update German translation. by Christian Stimming on Wednesday, January 9, 2008 - 5:24 pm. (3 messages)

Next thread: Re: [PATCH] Add [HOWTO] using merge subtree. by Junio C Hamano on Wednesday, January 9, 2008 - 6:33 pm. (2 messages)
speck-geostationary