Git as a filesystem

Previous thread: [PATCH] Added a new placeholder '%cm' for full commit message by Michal Vitecek on Friday, September 21, 2007 - 6:14 am. (12 messages)

Next thread: [PATCH] Move the paragraph specifying where the .idx and .pack files should be by Matt Kraai on Friday, September 21, 2007 - 9:43 am. (1 message)
To: <git@...>
Date: Friday, September 21, 2007 - 6:51 am

Hi!

Is it possible/feasible to use git as a filesystem?
Like having git on top of ext3.

This way I could do a gitfs-gc and there is only one
pack file sitting on the disk which is a compressed
version of the whole system.
I am not interested in a version controlled filesystem,
only in the space saving aspects.

Thanks,

Peter
-

To: Peter Stahlir <peter.stahlir@...>
Cc: <git@...>
Date: Friday, September 21, 2007 - 7:11 am

Hi,

I haven't looked at it closely, but there is a GitFS:

http://git.or.cz/gitwiki/InterfacesFrontendsAndTools#head-f354b40618742b...

(I am pointing you to the Git Wiki, so that you can find more pointers
should you not be happy with this one.)

Ciao,
Dscho

-

To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Peter Stahlir <peter.stahlir@...>, <git@...>
Date: Friday, September 21, 2007 - 10:22 am

On Fri, Sep 21, 2007 at 12:11:41PM +0100, Johannes Schindelin <Johannes.Sch=

fyi, last time i had a look at it, it did not compile with git 1.5.2.x

thanks,
- VMiklos

To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: <git@...>
Date: Friday, September 21, 2007 - 7:41 am

Thank you.
This is was I was looking for. My motivation is whether it is possible
to run a system, for example Debian on a computer on top of gitfs,
and then have a huge mirror on it, for example a complete 252GB
Debian mirror as space efficient as possible.

I wonder how big a deltified Debian mirror in one pack file would be. :)

Peter
-

To: Peter Stahlir <peter.stahlir@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, <git@...>
Date: Friday, September 21, 2007 - 9:22 am

It would be just as big as the non gitified storage on disk.

The space saving with git comes from efficient delta storage of
_versioned_ files, i.e. multiple nearly identical versions of the same
file where the stored delta is only the small difference between the
first full version and subsequent versions. Unless you plan on storing
many different Debian versions together, you won't benefit from any
delta at all. And since Debian packages are already compressed, git
won't be able to compress them further.

So don't waste your time.

Nicolas
-

To: Nicolas Pitre <nico@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Peter Stahlir <peter.stahlir@...>, <git@...>
Date: Friday, September 21, 2007 - 7:33 pm

On a similar note, has anybody experimented with using git to
store maildirs or news spools? I'd imagine the quoted portions of
most message threads could be delta-compressed quite efficiently.

--
Eric Wong
-

To: Eric Wong <normalperson@...>
Cc: Nicolas Pitre <nico@...>, Peter Stahlir <peter.stahlir@...>, <git@...>
Date: Friday, September 21, 2007 - 7:42 pm

Hi,

I store all my mail in a git repository. Works beautifully. Except that
the buffers on my laptop are constantly full :-( So a simple commit takes
some waiting.

Should be no issue on normal (desktop) machines.

Ciao,
Dscho

-

To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Nicolas Pitre <nico@...>, Peter Stahlir <peter.stahlir@...>, <git@...>
Date: Friday, September 21, 2007 - 10:06 pm

D'oh. I already have maildir performance problems on my laptop.

I wonder how well only having an index and no commits (no versioning),
and manual packing with pack-objects would work. Packing could be
optimized to order objects based on the Message-Id, References, and
In-Reply-To headers, too.

--
Eric Wong
-

To: Eric Wong <normalperson@...>
Cc: Nicolas Pitre <nico@...>, Peter Stahlir <peter.stahlir@...>, <git@...>
Date: Saturday, September 22, 2007 - 8:06 am

Hi,

Umm. Regular operation is not affected, since I (add and) commit only

The most efficient way would be to have a mailer backend accessing the
database, and then not have a working directory, methinks (especially with
these amounts of mail I am juggling ATM).

Time forbids working on this, though.

Ciao,
Dscho

-

To: Nicolas Pitre <nico@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, <git@...>
Date: Friday, September 21, 2007 - 9:35 am

The 252GB stem from the fact that there are more than 10 architectures.
I guess the /usr/share/doc of all architectures could be deltified (as could
be all files that are architecture-independent)

Right?
-

To: Peter Stahlir <peter.stahlir@...>
Cc: Nicolas Pitre <nico@...>, Johannes Schindelin <Johannes.Schindelin@...>, <git@...>
Date: Friday, September 21, 2007 - 11:46 am

I don't think so. Architecture-independent files are usually separated
out into separate packages (think of the -doc and -data packages) that
get architecture "all" and land in the Debian archive only once. So you
probably won't save too much there.

Chris
-

To: Peter Stahlir <peter.stahlir@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, <git@...>
Date: Friday, September 21, 2007 - 9:45 am

Indeed.

But how much does this represents, once compressed, compared to the
rest? I doubt it is significant enough for the trouble.

Nicolas
-

To: Peter Stahlir <peter.stahlir@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, <git@...>
Date: Friday, September 21, 2007 - 8:53 am

Very, very close to 252 GB, since .deb files are already compressed.

If it's just the gzip compression you want, surely there must be real
filesystems that can do that.

--
Karl Hasselström, kha@treskal.com
www.treskal.com/kalle
-

To: Karl Hasselström <kha@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, <git@...>
Date: Friday, September 21, 2007 - 9:28 am

Yes, but if there were deb and tar support in git (to automatically unpack
archives and store the contents), together with the best available
binary diffs I think the repository could be significantly smaller because
files common to all architectures could be deltified,

I did a quick check with 100MB of deb archives; the result was nearly 100MB
as you said.
I also did a quick check with all .so files in my /usr/lib directory; it shrunk
from 50MB to 20MB, the same is achieved with tar + bz2.

But the thing is, I think there is a lot of redundancy in
a) a Debian mirror or
b) your disk at home.

Telling git to handle -for example- deb archives and storing
everything in a pack file would take advantage of redundancy across
_all_ files.
So the /usr/share/doc of all architectures could be compressed.

Right?
-

To: Peter Stahlir <peter.stahlir@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Karl <kha@...>, <git@...>
Date: Friday, September 21, 2007 - 1:29 pm

You can unpack contain of gzipped or bzipped files and deltify it, but
you cannot restore exactly the same gzip or bzip file based on its
content unless you use exactly the same version of compressor that was
used to create the original file. So, if you put any .deb file in such
a system, you will get back a different .deb file (with a different SHA1).
So, aside high CPU and memory requirements, this system cannot work in
principle unless all users have exactly the same version of a compressor.

Dmitry
-

To: Dmitry Potapov <dpotapov@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Peter Stahlir <peter.stahlir@...>, Karl Hasselström <kha@...>, <git@...>
Date: Friday, September 21, 2007 - 7:56 pm

Was thinking the same - compression machinery, ordering of the files,
everything. It'd be a nightmare to ensure you get back the same .deb,
without a single different bit.

Debian packaging toolchain could be reworked to use a more GIT-like
approach - off the top of my head, at least

- signing/validating the "tree" of the package rather than the
completed package could allow the savings in distribution you mention,
decouple the signing from the compression, and simplify things like
debdiff

- git or git-like strategies for source packages

cheers,

m
-

To: Martin Langhoff <martin.langhoff@...>
Cc: Dmitry Potapov <dpotapov@...>, Johannes Schindelin <Johannes.Schindelin@...>, Peter Stahlir <peter.stahlir@...>, Karl Hasselström <kha@...>, <git@...>
Date: Friday, September 21, 2007 - 11:09 pm

Nightmare indeed. I actually wrote a proof of concept for this idea for
gzip.

http://git.catalyst.net.nz/gw?p=git.git;a=shortlog;h=archive-blobs
(see also
http://planet.catalyst.net.nz/blog/2006/07/17/samv/xteddy_caught_consumi...)

I usually warn people that this undertaking is "slightly insane".

My implementation was designed to be called like "git-hash-object".
What it did was look at the input stream, and detect quickly whether it
looked like a gzip stream. If it was, it would decompress it and then
try to compress the first few blocks using different compression
libraries and settings to determine what settings were used. If it
could find the right settings for the first meg or so, then it would
bank on the rest being identical as well, record which compressor and
what settings were used and write the uncompressed object, as well as
the information needed to reconstruct the gzip header, to a new type of
object called an "archive" object. If the stream could not be
reproduced then it would save the raw stream instead. For something
like a Debian archive, it is very likely that all compressed streams
will be reproducible, because they will almost all be compressed using
the same implementation of gzip.

For tar and .ar files, this can be slightly more deterministic of
course. It doesn't even need to be particularly savvy of what all the
fields are - just locate the files in the .tar, write out a tree, and
then write a TOC that lists tree entries and contains any extra data (ie
headers, etc).

In hindsight, making a new object type was probably a mistake. If I
were to re-undertake this I would not go down that path, though I'd
certainly consider using tag objects for the extra data, and throwing
them in the tree like submodules. It would also be essential in a
"real" solution to bundle reference copies of the zlib and gzip
compressors (yes, their output streams differ with longer inputs and
even some short ones).

Sam.
-

To: Peter Stahlir <peter.stahlir@...>
Cc: <git@...>
Date: Friday, September 21, 2007 - 10:38 am

Yes, surely. Your idea suggests that you want any file to be
reconstructed on-the-fly whenever it's being requested. Isn't
there the danger of killing performance, the CPU being the
bottleneck? I imagine such a debian mirror has quite some

I doubt so. There sure is lots of redundancy within each file and
that's what compressed file systems are good for. But what you
talk about is redundancy across (unversioned) files, and I don't
feel there is a lot of it. Yes, I might have a few copies of the
file COPYING on my disk, and maybe some of my sources share a few
functions, but this won't save me tons of space. All my binaries,
libraries, MP3s, videos, config files, etc don't really have any
redundancy across file boundaries. And even if there is, finding
that redundancy is an O(whatever-but-not-n) operation that would
be rather slow.

I definitely see gitfs (or similar ideas) as potentially being
useful in some cases (maybe debian mirrors could be one), but not
for my disk at home, which I generally would prefer to be faster
than more compressed.

jlh
-

To: Peter Stahlir <peter.stahlir@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Karl Hasselström <kha@...>, <git@...>
Date: Friday, September 21, 2007 - 9:41 am

You're proposing to trade off lots of CPU time in fetching many files
from a pack and making the package file -- paid every time someone
requests a package -- for at most 250 GB of space (cf Amdahl's law).

How long are your users willing to wait in exchange for 250 GB of
saved space? How much CPU are you willing to spend for it? Compare
those to the cost of a 300 GB hard drive (roughly $65).

There's also the cost to make git support the package format, and to
maintain that code going forward. Those costs are also large.

Michael Poole
-

Previous thread: [PATCH] Added a new placeholder '%cm' for full commit message by Michal Vitecek on Friday, September 21, 2007 - 6:14 am. (12 messages)

Next thread: [PATCH] Move the paragraph specifying where the .idx and .pack files should be by Matt Kraai on Friday, September 21, 2007 - 9:43 am. (1 message)