login
Header Space

 
 

Re: Using git as a general backup mechanism (was Re: Using GIT to store /etc)

Previous thread: [RFC \ WISH] Add -o option to git-rev-list by Marco Costalba on Sunday, December 10, 2006 - 7:38 am. (39 messages)

Next thread: *** SPAM *** by Clemens Buchacher on Sunday, December 10, 2006 - 9:57 am. (2 messages)
To: <git@...>
Date: Sunday, December 10, 2006 - 9:40 am

I've recently become somewhat interested in the idea of using GIT to  
store the contents of various folders in /etc.  However after a bit  
of playing with this, I discovered that GIT doesn't actually preserve  
all permission bits since that would cause problems with the more  
traditional software development model.  I'm curious if anyone has  
done this before; and if so, how they went about handling the  
permissions and ownership issues.

I spent a little time looking over how GIT stores and compares  
permission bits; trying to figure out if it's possible to patch in a  
new configuration variable or two; say "preserve_all_perms" and  
"preserve_owner", or maybe even "save_acls".  It looks like standard  
permission preservation is fairly basic; you would just need to patch  
a few routines which alter the permissions read in from disk or  
compare them with ones from the database.  On the other hand, it  
would appear that preserving ownership or full POSIX ACLs might be a  
bit of a challenge.

Thanks for your insight and advice!

Cheers,
Kyle Moffett

-
To: Kyle Moffett <mrmacman_g4@...>
Cc: <git@...>
Date: Monday, December 11, 2006 - 11:45 pm

The first thing you'd want to do is correct the fact that the index 
doesn't keep full permissions. We decided long ago that we don't want to 
track more than 0100, but we're discarding the rest between the filesystem 
and the index, rather than between the index and the tree. (This is weird 
of us, since we keep gid and uid in the index, as changedness heuristics, 
but don't keep permissions; of course, we'd have to apply umask to the 
index when we check it out to sync what we expect to be there with what 
has actually been created.)

I think that would be the only change needed to the index and 
index/working directory connection, although it might be necessary to 
support longer values for uid/gid/etc, since they'd be important data now.

Note that git only stores content, not incidental information. But a lot 
of information which is incidental in a source tree is content in /etc. 
This implies that /etc and working/linux-2.6 are fundamentally different 
sorts of things, because different aspects of them are content.

I'd suggest a new object type for a directory with permissions, ACLs, and 
so forth. It should probably use symbolic owner and group, too. My guess 
is that you'll want to use "commit"s, the new object type, and "blob"s. 
Everything that uses trees would need to have a version that uses the new 
type. But I think that you generally want different behavior anyway, so 
that's not a major issue.

	-Daniel
*This .sig left intentionally blank*
-
To: Daniel Barkalow <barkalow@...>
Cc: <git@...>
Date: Tuesday, December 12, 2006 - 9:49 am

Hmm, ok.  It would seem to be a reasonable requirement that if you  
want to change any of the "preserve_*_attributes" config options you  
need to blow away and recreate your index, no?  I would probably  
change the underlying index format pretty completely and stick a new  

Ahh, I hadn't thought of it that way before but that makes a lot of  

Ok, seems straightforward enough.  One other thing that crossed my  
mind was figuring out how to handle hardlinks.  The simplest solution  
would be to add an extra layer of indirection between the "file  
inode" and the "file data".  Instead of your directory pointing to a  
"file-data" blob and "file-attributes" object, it would point to an  
"file-inode" object with embedded attribute data and a pointer to the  
file contents blob.

I remember reading some discussions from the early days of GIT about  
how that was considered and discarded because the extra overhead  
wouldn't give any real tangible benefit.  On the other hand for  
something like /etc the added benefits of tracking extended  
attributes and hardlinks might outweigh the cost of a bunch of extra  
objects in the database.  A bit of care with the construction of the  
index file should make it sufficiently efficient for day-to-day usage.

If you're interested in some random musings about using GIT concepts  
to version whole filesystems (think checkpointing your disk drive and  
instantly restoring when you screw up), read on below, otherwise  
don't bother.

Cheers,
Kyle Moffett

&lt;Random Tangential Off-the-Wall Thought Experiment&gt;

NOTE: This probably belongs in it's own thread but it's such a  
random, undeveloped, and off-the-wall concept that I threw it in here  
just for kicks.

Combining extensions like those described above with something like  
the Ext3 block-allocation, inode-management and journalling code to  
produce a "versioned filesystem".  With the exponential growth of  
storage density over the last several years we've gotten to the poi...
To: Kyle Moffett <mrmacman_g4@...>
Cc: <git@...>
Date: Wednesday, December 13, 2006 - 2:10 pm

You should be able to promote an insufficient-version index to a 
new-version index that's needs to be refreshed for every entry. (And then 
update-index would take care of the necessary rewrite-everything in the 
normal way). But I suspect that the right thing is to require that the 
repository be created with a "commits-include-directories-not-trees" flag, 
and this means that you always use the extra-detailed index, and the 
options only affect what information is filtered out in transit between 
the directory object and the index. Having more information in the index 
is merely a potential waste of space, not a correctness issue (we have 
extra information for trees in the index now, remember); it just means 
that there are more things that will cause git to reread the file, rather 
than declaring it unchanged with a stat().

For that matter, it may be best for the directory objects to record what 
information in them is real, and keep the "what's content" mask in the 
index as well. If it changes over the history of a repository, you want to 

I was thinking this could be internal to the directory object, but you 
probably want to support hardlinks shared between dentries in different 
directory objects, so you're probably right that this makes sense. 

Alternatively, you could use a single "directory" object for the whole 
state (including subdirectories), making hardlinks out of the object 
clearly impossible, or you could use some scheme for sharing 
sub-"directory" objects that would imply that hardlinks are within an 
object (the hard part here is finding things when their locations aren't 
predictable by name).

	-Daniel
*This .sig left intentionally blank*
-
To: Daniel Barkalow <barkalow@...>
Cc: Kyle Moffett <mrmacman_g4@...>, <git@...>
Date: Thursday, December 14, 2006 - 1:06 am

So, I've been making little repositories for appropriately related
stuff.  For example, I have a repository for my ~/.bashrc,
~/.bash_profile, ~/.bash_completions/*, and such.

I recall Linus's post in the "VCS Comparison Table" thread, and after
thinking about it, I decided the best thing to do would be to have a
couple extra files tracked in the repository, alongside other data.

I use a backup shell script to copy things from my system to the
repository, and then I run getfacl on it all to write out all the
details to a 'facl' file in my repository.  Then I can make a commit.

Then there's a restore shell script to copy things back to my system,
and restore ownership and permissions with setfacl.

I store the backup and restore scripts in the repository.  Paths are
currently hard-coded.  I'm sure there's a more flexible way to do
this, though I'd need some means of representing the correspondence
between content in the repository and files in my filesystem.




-- 
epistemological humility
  Chris Riddoch
-
To: <git@...>
Date: Tuesday, December 12, 2006 - 11:53 am

I wonder if git's skill at managing content is the answer?  Rather than mess 
around with git's internals, the index, or the object database; how about 
simply having a pre-commit script that writes out a file that looks like:

-rw-r--r--  andyp andyp CHANGES
-rw-r--r--  andyp andyp COPYING
-rw-rw-r--  andyp andyp CREDITS
-rw-r--r--  andyp andyp Configure
-rw-rw-r--  andyp andyp Makefile
-rw-r--r--  andyp andyp README

If /that/ file were stored in the repository and you had a script that could 
read that file and apply the permissions after a checkout you'd have what you 
want.

If the permissions of a file changed but the content didn't, then 
this ".gitpermissions" file would have changed content but the file itself 
would remain the same.  If the content changed but not the permissions 
then ".gitpermissions" would be untouched.

Assuming that you're allowed to mess with the index in pre-commit (I haven't 
checked), one half of it can be automatic.  I suppose you could also plead 
for a post-checkout hook to apply those permissions and the whole lot would 
be transparent.



Andy
-- 
Dr Andy Parkins, M Eng (hons), MIEE
andyparkins@gmail.com
-
To: <git@...>
Date: Tuesday, December 12, 2006 - 6:49 pm

This discussion reminds me of a use of git I've had in the back of my 
head to try out for a while. Right now I'm doing my local snapshot 
backups using the rsync-with-hard-links scheme 
(http://www.mikerubel.org/computers/rsync_snapshots/ if you're not 
familiar with it). This is nice in that the contents of files that don't 
change are only stored once on the backup disk. But it is less than 
optimal in that a file that changes even a little bit is stored from 
scratch.

What would be great for this would be to store each day's backup as a 
git revision; with a periodic repack, this would be much more 
space-efficient than the rsync hard links.

The problem is that while that would give me a very efficient backup 
scheme, the repository would still grow over time. In rsync land, I 
solve the disk space issue by keeping two weeks' worth of daily 
snapshots, then six months' worth of weekly snapshots, then two years' 
worth of monthly snapshots; files that change daily have a constant 
number of revisions stored in my backups, and older files drop off the 
backup disk as they age.

Given that there's no way (or is there?) to delete revisions from the 
*beginning* of a git revision history, right now it seems like the only 
approach that comes close is to give up on the "daily then weekly then 
monthly" thing -- probably fine given the space savings of delta 
compression -- and periodically make shallow clones of the backup 
repository that fetch all but the first N revisions; once a shallow 
clone is made, the original gets deleted and the clone is the new backup 
repo.

But it would sure be more efficient to be able to "shallow-ize" an 
existing repository. That would be useful for things other than backups, 
too, e.g. the recent request for some way to track just the current 
version of the kernel code rather than its revision history. If there 
were a shallowize command, you could do something like "git pull; git 
shallowize --depth 1" to track the latest revision without kee...
To: Steven Grimm <koreth@...>
Cc: <git@...>
Date: Tuesday, December 12, 2006 - 7:43 pm

Why not use N independent branches?  I'd illustrate only with
two levels below, but you could:

 (0) make a full tree snapshot.  Store the commit in 'daily'
     branch as its tip.

 (1) A new day comes.  Create an empty branch 'daily' if you
     do not already have one.  Make a full tree snapshot, and
     create a parentless commit for the day if the 'daily'
     branch did not exist, or make it a child of the 'daily'
     commit from the previous day if the branch existed.

 (2) End of week comes.  Create an empty branch 'weekly' if you
     do not already have one.  Make a full tree snapshot, and
     create a parentless commit for the week if the 'weekly'
     branch did not exist, or make it a child of the 'weekly'
     commit from the last week.  Discard 'lastweek' branch if
     you have one, and rename 'daily' branch to 'lastweek'.

At the end of month, you can rename 'weekly' to 'lastmonth'; if
you discard previous 'lastmonth' at this point, you essentially
made files older than two months drop off the backup disk.  You
can add more hierarchy with longer period to extend the scheme
ad infinitum.

-
To: Junio C Hamano <junkio@...>
Cc: <git@...>
Date: Thursday, December 14, 2006 - 7:33 pm

That sounds like it'd work, but doesn't it imply that the history of a 
given file in the backups is not continuous? That is, an old copy of a 
file on the "weekly" branch doesn't have any kind of ancestor 
relationship with the same file on the "daily" branch? While that's 
obviously no different than the current git-less situation where there's 
no notion of ancestry at all, it'd be neat if this backup scheme could 
actually track long-term changes to individual files.

I wonder if rebasing can get me what I want. Something like:

(1) Make a new branch from the latest daily. Commit a full tree
    snapshot to the new branch. (Each branch has exactly one commit.)

(2) To expire a daily backup, rebase the second-oldest daily branch,
    which will initially be a child of the oldest daily branch, under
    the latest weekly branch instead. Delete the oldest daily branch.
    I believe the right commands here would be:

    git-rebase -s recursive -s ours --onto latest-weekly \
               oldest-daily second-oldest-daily
    git-branch -D oldest-daily

    (Not sure about the double "-s", but I want it to detect renames
    where possible and never flag any conflicts.)

(3) At the end of the week, instead of expiring the oldest daily
    branch, rename it to indicate that it's now a weekly snapshot.
    (That will implicitly do the first part of step 2, since the
    next daily branch in line will already be a descendant of the
    newly renamed branch.)

    Repeat step 2, rebasing against the latest monthly branch,
    to expire the oldest weekly.

(4) To expire an old monthly, rebase the second-oldest monthly branch
    under the initial empty revision, then delete the oldest monthly.
    This is basically step 2 again, but rebasing under a fixed starting
    point.

(5) Run git-prune to expire the objects in the deleted branches, then
    git-repack -a -d to delta-compress everything.

That's a bit convoluted, admittedly, and probably a perversion of 
everything ...
To: Steven Grimm <koreth@...>
Cc: <git@...>
Date: Thursday, December 14, 2006 - 8:33 pm

You can keep them connected by rewriting history of bounded
number of commits.  When you start a new week, you would make
the Monday commit a child of the tip of weekly branch that
represents the latest weekly shapshot.  Then on Friday, the
history would show the 5 commits during the week and behind that
would be a sequence of commits with one-per-week granularity.
When you rotate the week's daily log out and the commit for
Monday is based on the weekly history you are going to toss out,
you may need to rebase that week's daily log branch.

Let's say your policy is to keep daily log for at least one week
and enough number of end-of-week weekly logs.  Let's say it is
week #2 right now.

                        Aooo... (week #2 daily)
                       /|
                ooooooB |  (week #1 daily)
               /        |
     o--------o---------C (end-of-week weekly log)

The first commit in this week's daily log (A) would have two
parents: last commit from daily log of week #1 (B), and the
latest commit on the end-of-week weekly log (C).  Most likely, B
and C would have exactly the same tree.  That way, you would
have at least 7 days of daily log; at the end of this week you
would have close to 14 days but "keeping at least one week" is
satisfied.

When starting the 3rd week, you will discard 1st week's log; you
would need to rewrite 7 days worth of commits from week #2,
because the first commit of week #2 should now only have one
parent (C), and you would forget the commit on the last day of
week #1 as its parent (B).  Which cascades through 7 commits you
made during week #2.  You are not changing any trees, so this
should be quite efficient.

Then the first daily commit of 3rd week would have two parents,
the commit at the end of week #2 daily branch (D), and a new
commit (E) at the tip of the end-of-week log.  Again, D and E
would have the identical trees.

                                o...... (week #3 daily)
                               /|
              ...
To: Steven Grimm <koreth@...>
Cc: <git@...>
Date: Tuesday, December 12, 2006 - 7:15 pm

Steven,

I've been thinking myself of writing a pdumpfs lookalike that uses git
internally. Sounds you you've got one already ;-)

In terms of getting rid of old history, have you considered moving a
graft point "forward" in time, and running git-repack -a -d? With your
history being (mostly?) linear this could be a workable scheme, but I
don't have much practice with using grafts.

cheers,


martin
-
To: Steven Grimm <koreth@...>
Cc: <git@...>
Date: Tuesday, December 12, 2006 - 7:23 pm

Actually - what I was considering was mixing the "daily commit" with
GITFS ;-) http://www.sfgoth.com/~mitch/linux/gitfs/

are your scripts published anywhere?

cheers,


martin
-
To: Steven Grimm <koreth@...>
Cc: <git@...>
Date: Tuesday, December 12, 2006 - 6:57 pm

Hi,


Almost!

$ git pull --depth 1

Though it needs a server _and_ a client supporting shallow clones, which 
support is brewed in "next" right now.

Ciao,
Dscho

-
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: <git@...>
Date: Tuesday, December 12, 2006 - 7:06 pm

Will that actually discard old revisions that are already stored locally?

-Steve

-
To: Steven Grimm <koreth@...>
Cc: <git@...>
Date: Tuesday, December 12, 2006 - 8:01 pm

Hi,


No. A pull should _never_ lose anything from the repository. However, if 
some objects become no-longer reachable (and at the moment it looks like 
we cut of history, even if we should not need to), they can be pruned from 
the repo.

Hth,
Dscho

-
To: Kyle Moffett <mrmacman_g4@...>
Cc: <git@...>
Date: Monday, December 11, 2006 - 6:50 am

I keep the files I want to track in a separate folder that I track
with Git and use a Makefile for updating /etc.  I basically have a
rule for checking for differences between the tracked folder and /etc
and a rule for installing changed files (with the correct
permissions).  It works, but it does require some "Makefile magic" to
work right (or the way /I/ want it anyway).

  nikolai
-
To: Kyle Moffett <mrmacman_g4@...>
Cc: <git@...>
Date: Sunday, December 10, 2006 - 11:06 am

I have not used it, but you could try:

http://www.isisetup.ch/

that uses git as a backend.

Santi
-
To: <git@...>
Date: Tuesday, January 9, 2007 - 9:39 pm

I want to have a tripwire-like system checking the files to make sure that they 
haven't changed unexpectedly. the program I'm looking at notices inode as well 
as timestamp and content changed.

when you checkout a file from git will it re-write/overwrite a file that hasn't 
changed or will it realize there is no change and leave it as-is?

does this answer change if there is a trigger on checkout (to change permissions 
or otherwise manipulate the file)?

David Lang
-
To: David Lang <david.lang@...>
Cc: <git@...>
Date: Tuesday, January 9, 2007 - 10:30 pm

If the stat data is current it will leave it as-is.  You can force
the index to refresh with `git update-index --refresh` or by running

Only if the trigger does something in addition, like force overwrite
files.  But we don't have a checkout trigger.  So there's no trigger.

-- 
Shawn.
-
To: Shawn O. Pearce <spearce@...>
Cc: <git@...>
Date: Wednesday, January 10, 2007 - 2:34 pm

I was looking at checkout, not checkin so I'm not understanding how the index is 

we don't have a checkout trigger? I thought that what Linus had suggested for 
permissions was to have a script triggered on checkin that stored the 
permissions of the files, and a script triggered on checkout that set the 
permissions from the stored file.

if there isn't a checkout trigger how would the permissions ever get set?

in my particular case I'd like to have the checkin run a script that produces a 
'generic' version of each file, and the checkout run a script that converts the 
generic version into the host specific version. I already have a script that 
does this work (and (ab)uses ssh to propogate the generic version to other hosts 
and create the host specific versions there), but I was interested in useing git 
to add better version control to the generic versions of the files (I currently 
use RCS on each box to version control the host specific versions)

David Lang
-
To: David Lang <david.lang@...>
Cc: <git@...>
Date: Thursday, January 11, 2007 - 8:55 pm

During checkout we use the index to help us decide if a file needs
to be updated with new content or can be left as-is.  Its a cache of
what version each file is at, and its based on the file stat data
(dev, inode, modification date, etc.) to tell us if the file has
been modified or was last created by Git.  If Git was the one that
last modified the file and the version stored in the index matches
the version needed during the checkout, the file is left alone.



Someone needs to implement support for a post-checkout trigger.  _Then_

You may be able to do that in the pre-commit hook by updating the index

-- 
Shawn.
-
To: <sbejar@...>, Jeff Garzik <jeff@...>
Cc: <git@...>
Date: Sunday, December 10, 2006 - 1:46 pm

On Dec 10, 2006, at 10:06:14, Santi B
To: Kyle Moffett <mrmacman_g4@...>
Cc: <git@...>
Date: Sunday, December 10, 2006 - 10:49 am

It's a great idea, something I would like to do, and something I've 
suggested before.  You could dig through the mailing list archives, if 
you're motivated.

I actively use git to version, store and distribute an exim mail 
configuration across six servers.  So far my solution has been a 'fix 
perms' script, or using the file perm checking capabilities of cfengine.

But it would be a lot better if git natively cared about ownership and 
permissions (presumably via an option).

	Jeff



-
Previous thread: [RFC \ WISH] Add -o option to git-rev-list by Marco Costalba on Sunday, December 10, 2006 - 7:38 am. (39 messages)

Next thread: *** SPAM *** by Clemens Buchacher on Sunday, December 10, 2006 - 9:57 am. (2 messages)
speck-geostationary