Re: git and mtime

Previous thread: [PATCH] Fix t4030-diff-textconv.sh by Alex Riesen on Wednesday, November 19, 2008 - 4:14 am. (1 message)

Next thread: Re: git bisect do not work when good reversion is newer than bad reversion by Junio C Hamano on Wednesday, November 19, 2008 - 6:00 am. (2 messages)
From: Roger Leigh
Subject: git and mtime
Date: Wednesday, November 19, 2008 - 4:37 am

Hi folks,

I'm using git to store some generated files, as well as their sources.
(This is in the context of Debian package development, where entire
upstream release tarballs are injected into an upstream branch, with
Debian releases merging the upstream branch, and adding the Debian
packaging files.)

The upstream release tarballs contains files such as
- yacc/lex code, and the corresponding generated sources
- Docbook/XML code, and corresponding HTML/PDF documentation

These are provided by upstream so that end users don't need these tools
installed (particularly docbook, since the toolchain is so flaky on
different systems).  However, the fact that git isn't storing the
mtime of the files confuses make, so it then tries to regenerate these
(already up-to-date) files, and fails in the process since the tools
aren't available.

Would it be possible for git to store the mtime of files in the tree?

This would make it possible to do this type of work in git, since it's
currently a bit random as to whether it works or not.  This only
started when I upgraded to an amd64 architecture from powerpc32,
I guess it's maybe using high-resolution timestamps.


Thanks,
Roger


P.S. The repo I'm working on here is at
     git://git.debian.org/git/collab-maint/gutenprint.git

--=20
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
From: Matthias Kestenholz
Date: Wednesday, November 19, 2008 - 5:22 am

Hi,


This subject comes up from time to time, but the answer always
stays the same: No. The trees are purely defined by their content, and
that's by design.

If you do not want to regenerate files that are already up-to-date,
you need multiple checkouts of the same repository.



Thanks,
Matthias
--

From: Andreas Ericsson
Date: Thursday, November 20, 2008 - 1:38 am

Or a make-rule that touches the files you know are up to date. Since you
control the build environment, that's probably the simplest solution.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
--

From: Roger Leigh
Date: Thursday, November 20, 2008 - 4:20 am

This is the approach I'm currently taking, since it's simple and
doesn't require any tool changes.  Ideally, I'd like to avoid
such hackiness, though.

I understand all the arguments I've seen in favour of not using the
mtime of the files when checking out.  They make sense.  However,
in some situations (such as this), they do not--git is breaking
something that was previously working.  In my case, I'm
injecting *release tarballs* into git, and the timestamps on the
files really do matter.  Regarding issues with branching and branch
switching, I always do builds from clean in this case.

If an option was added to git-checkout to restore mtimes, it need
not be the default, but git could record them on commit and then
restore them if asked /explicitly/.

For this, and some other uses I have in mind for git, it would be
great if git could store some more components of the inode
metadata in the tree, such as:
- mtime
- user
- group
- full permissions
- and also allow storage of the full range of file types (i.e.
  block, character, pipe, etc.)

This would allow git to be used as the basis for a complete
functional versioned filesystem (which I'd like to use for my
lightweight virtualisation tool, schroot, which currently
uses LVM snapshots for this purpose).


Regards,
Roger

--=20
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
From: Andreas Ericsson
Date: Thursday, November 20, 2008 - 5:48 am

You can. The way to ask explicitly right now is to write hooks
that implement the functionality you want. It's not as easy as
setting a config value, but since you'd have to write the patch
to do that anyways (and it's likely it will get dropped), you'd
be better off writing some hooks and submitting them as contrib

I believe someone else has done some work along the way of
turning git into complete-with-metadata backupsystem before.
Google might prove beneficial.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
--

From: Andreas Ericsson
Date: Thursday, November 20, 2008 - 6:12 am

Although now that I come to think of it, storing "user" and
"group" made it near-enough totally useless for anything a
user had created as the repos hardly ever could be shared.

I'll say it again; Hooks can be written to handle this.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
--

From: Johannes Schindelin
Date: Wednesday, November 19, 2008 - 5:31 am

Hi,


No, since this would wreck people's workflows:

	- compile in branch "master"
	- switch to branch "topic"
	- compile
	- switch back to branch "master"

Now you _want_ files in "master" that were changed in "topic" to be 
recompiled.

This is a quite common case.

However, nothing hinders you having your own ".gitmtimes" in the tree, and 
a script people can use as a hook, which applies the mtimes to the files.

Ciao,
Dscho

--

From: Jakub Narebski
Date: Wednesday, November 19, 2008 - 6:29 am

I don't think it would be done as in core change at all, or at least
soon.

You can use Metastore, or some custom clean/smudge gitattribute
filters with something like Metastore (or etckeeper) to store extra
metadata about files in your tree.

See http://git.or.cz/gitwiki/InterfacesFrontendsAndTools

-- 
Jakub Narebski
Poland
ShadeHawk on #git
--

From: Arafangion
Date: Wednesday, November 19, 2008 - 5:37 am

On Wed, 2008-11-19 at 11:37 +0000, Roger Leigh wrote:

Unless I'm mistaken, I was under the impression that the reason why git
doesn't, and shouldn't do this is _because_ it confuses make.

Suppose you've got two branches, and you check out the other branch,
resulting in changes in 3 files.  Should git go and modify the mtime for
every single file, and remove any file that isn't part of the repo (Such
as generated object files)?

If it modifies the dates on every file, but doesn't remove the generated
object files, how does make handle that, as it'll likely generate some
of the object files, but not all of them.

If it doesn't, but touches the files that changed, and the dates are now
older than the corresponding object files, make would fail to recompile
the project properly!

The only way this could work is if you never switch branches, which is
quite limiting for git, and never check out an older revision, which is
quite limiting for the RCS systems in general.

You should probably fix your build script, or add a hook script that
sets the dates on the files in question manually, but the former
solution would be much better.

--

From: Matthieu Moy
Date: Wednesday, November 19, 2008 - 7:54 am

ccache should help:

http://ccache.samba.org/

-- 
Matthieu
--

From: Andreas Ericsson
Date: Thursday, November 20, 2008 - 1:39 am

Not for docbook/flex/yacc stuff, which is what was causing trouble.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
--

From: Johannes Schindelin
Date: Thursday, November 20, 2008 - 3:34 am

Hi,


Only if you _allow_ the problem that makes ccache necessary.  Which we 
don't.

Ciao,
Dscho

P.S.: reminds me -- once again -- of the complicator's glove.
--

From: Matthieu Moy
Date: Thursday, November 20, 2008 - 3:53 am

You do, for some definition of "problem".

make
git checkout whatever
make
git checkout where-you-were-before
make # <--- this one

The last make is correct with plain git and plain make, but it can be
slow. With a cache in your build system, it can just reuse the objects
created during the first "make". It doesn't change correctness, only
performance.

-- 
Matthieu
--

From: Christian MICHON
Date: Wednesday, November 19, 2008 - 9:18 am

beside the obvious answer it comes back often as a request, it is
possible in theory to create a shell script which, for each file
present in the sandbox in the current branch, would find the mtime of
the last commit on that file (quite an expensive operation) and apply
it.

I had a need for this once, then lost interest since using git as it
is is so much better than trying to mimic behaviour of old scm tools
and makefiles.

You should store mostly content of source files. You should do a make
in your first cloned repo at least once before committing anything to
the repo. That's what I did and I saved days...

-- 
Christian
--
http://detaolb.sourceforge.net/, a linux distribution for Qemu with Git inside !
--

From: Johannes Schindelin
Date: Thursday, November 20, 2008 - 3:35 am

Hi,


I had a need like this, too, and solved it by teaching the build process 
to fall back to generated files if the tool to generate them was not 
available.

Ciao,
Dscho

--

From: Roger Leigh
Date: Thursday, November 20, 2008 - 4:27 am

Surely this is only expensive because you're not already storing the
information in the tree; if it was there, it would be (relatively)
cheap?  You could even compare the old and new trees to see if you

Except in this case I'm storing the content of *tarballs* (along with
pristine-tar).  I'm committing exactly what's in the tarball with
no changes (this is a requirement).  I can't change the source prior
to commit.


Regards,
Roger

--=20
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
From: Andreas Ericsson
Date: Thursday, November 20, 2008 - 6:06 am

No, it's because git is *snapshot* based and doesn't care about anything
but contents. Storing filestate information in the tree would be a
backwards incompatible change that would require a major version change.

Caring about meta-data the way you mean it would mean that

  git add foo.c; git commit -m "kapooie"; touch foo.c; git status

would show "foo.c" as modified. How sane is that? Or should we introduce
a new concept for altered metadata only? "metafied"? So what do we do
when the next user whizzes along and wants support for full acl's? And
what do we do when Windows (or some other bizarre system) add some sort
of extension so we have to have different types of ACL support on both

We already do that by matching the SHA1 hash for the index entries.
Only content that is actually different between to branches are altered
upon checkout (which is why it's so damn fast when you're using topic-
branches properly).

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
--

From: Roger Leigh
Date: Thursday, November 20, 2008 - 7:15 am

It's not strictly true that it's only caring about contents.  The
contents are of course in the blobs, but the tree is already
effectively storing inode data, since it's a directory of
filenames/subtrees, just one that only cares to store the
permissions part of the total inode data.

I understand that git stored the permissions tacked onto the hash;
would it be feasable to tack on the other bits as well.
If I understand correctly, it's binary encoded in the pack format,
and that would require updating the format to hold the additional

I've never come close to suggesting we do anything so insane.

What I am suggesting is that on add/commit, the inode metadata
be recorded in the tree (like we already store perms), so that
it can be (**optionally**) reused/restored on checkout.

Whether it's stored in the tree or not is a separate concern from
whether to *use* it or not.  For most situations, it won't be
useful, as has been made quite clear from all of the replies, and I
don't disagree with this.  However, for some, the ability to have
this information to hand to make use of would be invaluable.


There have been quite a few suggestions to look into using hooks,
and I'll investigate this.  However, I do have some concerns
about *where* I would store this "extended tree" data, since it
is implicitly tied to a single tree object, and I wouldn't
want to store it directly as content.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
--

From: Andreas Ericsson
Date: Thursday, November 20, 2008 - 7:50 am

No, that would break backwards compatibility with cross-repo

Then write a hook for it. You agree that for most users this will be
totally insane, and yet you request that it's added in a place where
everyone will have to pay the performance/diskspace penalty for it
but only a handful will get any benefits. That's patently absurd.
Especially since there are such easy workarounds that you can put in

Store it as a blob targeted by a lightweight tag named
"metadata.$sha1" and you'll have the easiest time in the world when
writing the hooks. Also, the tags won't be propagated by default,
which is a good thing since your timestamps/uid's whatever almost
certainly will not work well on other developers repositories.

That's what I'd do anyways.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
--

From: Roger Leigh
Date: Thursday, November 20, 2008 - 8:19 am

The cost is tiny.  The extra space would be smaller than a single

And yet the fact that it won't propagate makes it totally useless:
all the other people using the repo won't get the extra metadata
that will prevent build failures.  Having the extra data locally
is nice, but not exactly what I'd call a solution.  The whole point
of what I want is to have it as an integral part of the repo.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
--

From: Kyle Moffett
Date: Thursday, November 20, 2008 - 8:33 am

Easiest way is typically something like this in the makefile:

docbook_version = $(shell docbook2man --version 2>/dev/null)
ifneq "$docbook_version",""

mymanpage.1:
        ## Real docbook build rules here

else

mymanpage.1:
        if [ -e $@ ]; then \
                echo "No 'docbook' installed, using pregenerated man
pages" >&2 ; \
        else \
                echo "Pregenerated manpages are missing and no docbook
found!" >&2 ; \
                exit 1 ; \
        fi

endif

Such stuff will take an order of magnitude less time than trying to
patch GIT to preserve metadata that most projects don't want
preserved.  You may also find it's easier to just comment out the
documentation build rules if you are always guaranteeing that the docs
have been compiled.

Cheers,
Kyle Moffett
--

From: Andreas Ericsson
Date: Thursday, November 20, 2008 - 8:37 am

Then make it signed tags and ship them along.

Or do this properly and simply put in your buildsystem that some
targets never need to be rebuilt. That's (by far) the simplest
solution.

On a sidenote, I fail to see how the pre-generated stuff can avoid
getting updated unless also the sources for that stuff was updated,
in which case either of the following is true:
a) You really do need to rebuild, because upstream fucked up.
b) The pre-generated stuff should *also* be checked out and get new
   timestamps.

Either way, to me it sounds like your buildsystem needs some love.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
--

From: Matthias Kestenholz
Date: Thursday, November 20, 2008 - 11:36 am

No, the cost is huge. The SHA-1 for the tree with _exactly the same
contents_ will be different, just because f.e you applied a patch one
second earlier than I did, and that's completely insane. Git is purely
a content tracker as has been said numerous times on this mailing
list, and that is for good reasons. If the tree entries change just
because some timestamps are different, the CPU time needed to
generate a diff will grow by a big amount of time.

Atempts to add additional information to the basic git objects have
failed several times, and yours will probably fail too since there are
numerous reasons why you do _not_ want a timestamp in the tree
_and_ there are several workarounds for your problem, which at


--

From: Randal L. Schwartz
Date: Thursday, November 20, 2008 - 6:11 am

>>>>> "Roger" == Roger Leigh <rleigh@codelibre.net> writes:

Roger> Except in this case I'm storing the content of *tarballs* (along with
Roger> pristine-tar).  I'm committing exactly what's in the tarball with
Roger> no changes (this is a requirement).  I can't change the source prior
Roger> to commit.

If you're not doing distributed source code development, why are you using
git?  It's hard to be angry at a screwdriver for not pounding in nails
properly.

Sounds like you want rsync or something.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See http://methodsandmessages.vox.com/ for Smalltalk and Seaside discussion
--

From: Roger Leigh
Date: Thursday, November 20, 2008 - 6:40 am

Err, it *is* being used for distributed development... of Debian
packaging.  We track upstream releases on one branch, merge this
periodically onto the master branch containing the Debian packaging
infrastructure, and also have other bits such as a
continually-rebased patches branch to generate quilt patch series

I think not!  Perhaps if you read my original mail, you might
understand the reasoning behind this (whether you consider that
valid reasoning or not is another matter).


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
--

From: Daniel Barkalow
Date: Thursday, November 20, 2008 - 10:59 am

Can you store the tarballs in the repository, instead of the contents of 
the tarballs? The tarballs will contain the dates you want, and you can 
obviously get tar to set the timestamps the way you want. (Then you add a 
higher-level Makefile that knows how to unpack the tarball to a directory, 
maintaining the timestamps, patch anything you're changing, and run make 
in that directory.)

That is to say, from your perspective, the sources include the upstream 
distributed tarballs, but the individual files in upstream tarballs aren't 
source files for you, since you can't (by policy) modify them (within the 
pristine tarball). If you want to change the sources of the packaged 
project, you add a patch file to do it, rather than simply changing the 
source (which, as you say, you're required not to do).

Git really wants to store the inputs to your workflow, each of which might
change independently. That's why the files in your work tree have 
timestamps based on when they came to be in your work tree (get set to the 
current time whenever git puts different content there, and leaves them 
unchanged if their contents don't change when moving from commit to 
commit). The "sources" in your workflow are a different set of files from 
the sources in the project, and git really wants *your* repository to 
match *your* workflow and not the workflow of the upstream project, when 
you're acting as a packager rather than an upstream developer.

	-Daniel
*This .sig left intentionally blank*
--

From: Joey Hess
Date: Thursday, November 20, 2008 - 12:24 pm

Note that pristine-tar will work no matter what the mtimes or other file
metadata are, none of that affects generation of deltas or regeneration
of tarballs from them.

Also, the source you commit does not really have to be identical to
what's in the tarball. (Despite what it may say in the man page. ;-)
A larger delta will be generated if something is different.

So, three possible approaches:

1. Run make or whatever you need to do before running pristine-tar,
   and put up with a larger delta.

2. Before building, you could use pristine-tar to extract the original
   tarball, and then have a program examine that tarball, and reset the
   mtimes in your build tree to match the mtimes of files in it.
   (Or you could duplicate the info with metastore -m, which could be
   restored quicker.)

3. Store uncompressed tarballs in git, so that they will pack
   efficiently, and use pristine-gz to regenerate the pristine .tar.gz.
   Only mentioned because this could be more space efficient than option
   #1, if the pristine-tar deltas get too large.

--=20
see shy jo
From: martin f krafft
Date: Thursday, November 20, 2008 - 6:21 am

I don't get it. Why are end users running make in the first place?
Why aren't those in the build-dependencies?

--=20
martin | http://madduck.net/ | http://two.sentenc.es/
=20
it is better to have loft and lost
than to never have loft at all.
                                                       -- groucho marx
=20
spamtraps: madduck.bogus@madduck.net
From: Roger Leigh
Date: Thursday, November 20, 2008 - 6:35 am

By end user, I mean person downloading and building the sources.

They are optional build depdendencies.  They are provided pre-built,
and won't be rebuilt unless they get outdated.  In the release
tarball, the timestamps are correct, ensuring this never happens.
When checking out with git, the timestamps are incorrect, and it
attempts to rebuild something that's *already built*.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
--

From: Johannes Schindelin
Date: Thursday, November 20, 2008 - 7:07 am

Hi,


I'll try just one more time.  Why don't you teach your build process to 
check if the generated files can be generated, and if not, fall back to 
the committed ones?

Ciao,
Dscho

--

From: Roger Leigh
Date: Thursday, November 20, 2008 - 7:22 am

Well, it's definitely not a good idea to try rebuilding when the tools
aren't available, and I'll update the Makefiles to only attempt a
rebuild when this is the case.  So yes, making the build a bit more
intelligent is definitely something to do.  However, this is really
a separate issue, since the repo dates back eight years, and I don't
want to break older stuff.  This will only fix things for the future.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
--

From: martin f krafft
Date: Thursday, November 20, 2008 - 6:59 am

I know you will hate me, but I think the solution here is to fix the
toolchain and make those build dependencies required.

--=20
martin | http://madduck.net/ | http://two.sentenc.es/
=20
"first get your facts; then you can distort them at your leisure."
                                                       -- mark twain
=20
spamtraps: madduck.bogus@madduck.net
From: Samuel Tardieu
Date: Thursday, November 20, 2008 - 8:56 am

>>>>> "martin" == martin f krafft <madduck@madduck.net> writes:

martin> I know you will hate me, but I think the solution here is to
martin> fix the toolchain and make those build dependencies required.

I agree with martin here. Your planned solution of not rebuilding the
files if the tools are not present may lead to serious problems if the
user modifies the source files and happens not to have the tools
around.

Moreover, requiring the build dependencies would allow you to drop the
generated files from the repository and rebuild them in your packaging
(source or binary) process.

  Sam
-- 
Samuel Tardieu -- sam@rfc1149.net -- http://www.rfc1149.net/

--

Previous thread: [PATCH] Fix t4030-diff-textconv.sh by Alex Riesen on Wednesday, November 19, 2008 - 4:14 am. (1 message)

Next thread: Re: git bisect do not work when good reversion is newer than bad reversion by Junio C Hamano on Wednesday, November 19, 2008 - 6:00 am. (2 messages)