Re: removing content from git history

Previous thread: Re: [PATCH 1.5.0.1.37] fix git-remote inconsistent about use of dots in remote names by Bart Trojanowski on Wednesday, February 21, 2007 - 3:04 am. (1 message)

Next thread: [PATCH] git-apply: notice "diff --git" patch again by Junio C Hamano on Wednesday, February 21, 2007 - 3:31 pm. (2 messages)
From: Michael Hendricks
Date: Wednesday, February 21, 2007 - 9:45 am

I assume that this question has already been addressed on the mailing
list, but I wasn't able to find anything about it in the archives.

Is it possible to remove content entirely from git's history?  I have a
client who does not use git for version control.  A couple months ago
they committed some sensitive client information which should never have
been committed.  Recently, they realized the mistake and now want to
remove all traces of the mistake from history.

I would like to migrate them to git at some point.  However, if they had
been using git for version control already, I'm not sure how I would
solved this particular problem.  Any suggestions?

-- 
Michael
-

From: Shawn O. Pearce
Date: Wednesday, February 21, 2007 - 9:56 am

No, not once it has been published around to another repository.
Since every developer has a copy of the repository its very difficult
to remove something, as it must be removed from every developer's
repository, and each developer must perform an action to agree to

The *only* way to do this in Git is to completely recreate every
commit after that point.  This changes all commit IDs and basically
forks the project into two completely different histories: the
one with the bad thing in it, and the one without the bad thing.
Users who have the bad thing will continue to have the bad thing
until they take explicit action to throw away all of that history
and switch to the other one.

Now this is actually not a huge deal if you do it on your local
repository and go "whoops, I should not have committed that".  If you
have not yet pushed the commit to another repository (and someone
has not yet fetched it from you either) you can use git-rebase to
discard it.  But once its been pushed/fetched the genie is out of
the bottle, and its not going back in.

-- 
Shawn.
-

From: J. Bruce Fields
Date: Wednesday, February 21, 2007 - 10:17 am

Also it can't have done any (non-fast-forward) merges since then.

Reconstructing history with a bunch of merges seems like something that
could be a huge pain.  (Though with some tools it might be doable.)

--b.
-

From: Linus Torvalds
Date: Wednesday, February 21, 2007 - 11:02 am

It's not actually that painful, but it *is* expensive.

I wrote git-convert-cache (now "git convert-objects") back when we did the 
SHA1/compression switchover changes and the date format translation, so 
we've actually had a tool that can do history rewriting pretty much since 
day 1 (well, "day 14", to be exact, but still.. April 2005).

BUT:

 - I'm not guaranteeing that it works any more. We haven't changed the 
   fundamental object format since, so that particular program has never 
   gotten any testing. It still compiles, but does it work? I dunno.

   I actually tested it on git itself. It converted the top of the git 
   tree successfully, and generated a *new* git history. Why? Because it 
   will actually rewrite the old git tree entries that have permission 
   0664 into 0644: the *data* will be identical (and no git tools except 
   for "git fsck --pedantic" will even notice the difference), but the 
   converted tree avoids one of the legacy decisions that we never fixed 
   in the git repository itself.

   So it works at least to *some* degree, but I would suggest you be very 
   very careful!

 - it can be slow. For something like git, which isn't *that* big, and 
   where we actually don't need to do a lot of rewriting (ie all the blobs 
   stay the same, and only a few trees have to be rewritten, and so it's 
   really just rewriting commits), it's not that bad. It actyally 
   converted the whole git history in less than ten seconds for me.

   But if you have a *huge* tree, and you actually convert objects too 
   (say, you started using git on Windows before the "autocrlf" thing, and 
   want to convert the old blobs from CRLF -> LF), it would

    (a) require some extensions to convert-object.c to do the blob 
        conversion
    (b) be *much* slower
    (c) generate tons of unpacked objects (because git-convert-objects 
        doesn't know to pack in between, and doesn't use anything 
        newfangled like "git-fast-import" to do ...
From: Linus Torvalds
Date: Wednesday, February 21, 2007 - 11:24 am

Side note: I wasn't entirelyaccurate. The kernel had trees with file mode 
0644 for all the early commits, because my umask is 0022. So everything up 
to commit 4bfa437cf1 is shared after the conversion.

But the next one (commit 5dfa9c1b4f) introduced the file 
include/asm-mips/vr41xx/pci.h with file mode 0664, and I'm not 100% sure 
why that one happened with that file mode, but as a result, every single 
commit ever after will have a different SHA1, because the tree got 
rewritten (and subsequent commits - even if their trees did *not* get 
rewritten - will obviously have different parent SHA1's).

So 56 commits are shared, and "only" 49276 commits were rewritten (and 
apparently 245 trees).

			Linus
-

From: Shawn O. Pearce
Date: Wednesday, February 21, 2007 - 2:00 pm

One idea Junio and I kicked around on #git a short while ago
was to arrange for a pipe between the current Git process
and git-fast-import, where the pipe was used from within
write_sha1_file() rather than creating the loose object.

This way an existing process like git-apply or git-convert-objects
could easily spew hundreds of thousands of objects without needing
to worry about repacking in the middle; nor would we need to worry
about the complexity of trying to disentagle the multiobject packing
parts of fast-import into some sort of library.

Obviously this is only a good idea if we are going to be making
enough objects to warrant using a packfile; small 10-20 bursts
of objects from a git-apply doesn't really justify a packfile.
But applying 100s of patches in a row might, if we could keep them
all fed through the same git-fast-import backend (and thus into
the same packfile).

-- 
Shawn.
-

From: Linus Torvalds
Date: Wednesday, February 21, 2007 - 2:11 pm

The probnlem there is that most conversion scripts that use 
"write_sha1_file()" will want to *read* that file later. If 
git-fast-import hasn't generated the pack yet (because it's still waiting 
for more data), that will not work at all.

So then you basically force the conversion script to keep remembering all 
the old object data (using something like pretend_sha1_file), or you limit 
it to things that just always re-write the whole object and never need any 
old object references that they might have written.

A lot of conversions tend to be incremental, ie they will depend on the 
data they converted previously.

			Linus
-

From: Shawn O. Pearce
Date: Wednesday, February 21, 2007 - 2:21 pm

Which is why I was actually thinking of flipping this on its head.
Libify git-apply and embed that into fast-import, then one of the
native input formats might just be an mbox, or something close enough
that a simple C/perl/sed prefilter could make an mbox into the input.

fast-import can (and does if necessary) go back to access the
packfile it is writing.  It has the index data held in memory and
uses only OBJ_OFS_REF so that sha1_file.c can unpack deltas just
fine, even though we lack an index file and have not completely
checksummed the pack itself.

So although no other Git process can use the packfile, it is usuable
from within fast-import...

-- 
Shawn.
-

From: Bill Lear
Date: Tuesday, October 9, 2007 - 1:58 pm

I'm resurrecting this old thread, as we have come across a similar need and
I could not tell if this has been settled.  More below...


As I understand this thread, it does not appear that a resolution
was reached.  Our company has content in our central git repository
that we need to remove per a contractual obligation.  I believe the
content in question is limited to one sub-directory, that has existed
since (or near to) the beginning of the repo, if that matters.  We
obviously would just like to issue a "git nuke" operation and be done
with it, if that is available.  Barring that, we could probably follow
reasonably simple steps to purge the content and rebuild the repo.

So, what options do we have at present?


Bill
-

From: J. Bruce Fields
Date: Tuesday, October 9, 2007 - 2:02 pm

Have you looked at git-filter-branch in a recent version of git?  The
man page has some good examples.

--b.
-

From: Bill Lear
Date: Tuesday, October 9, 2007 - 3:25 pm

Ah, no, though I will do so.  It is apparently not in the version
I have (1.5.2.4), but it is in 1.5.3.1.  We'll give this a shot
and complain if we can't handle it.

Thank you.


Bill
-

From: Johannes Schindelin
Date: Wednesday, October 10, 2007 - 7:41 am

Hi,


git filter-branch.  I suggest using the index filter.  There is even a 
nice example in the man page of git filter-branch.

Which reminds me that I have some TODOs left in filter-branch...

Ciao,
Dscho

-

From: Linus Torvalds
Date: Wednesday, February 21, 2007 - 10:14 am

It's been discussed.

There are two options for doing it:

 - rewriting history. There are a few tools for this already, and for 
   specific needs it would be fairly easy to resurrect git-convert-objects 
   to do it for any kind of object.

   See "cg-admin-rewritehist" from cogito for an example of a tool that 
   would do what you need done. In fact, it has this exact thing as the 
   first example.

   (Btw, I think cg-admin-rewritehist is one of the few things that cogito 
   had that was really a good idea. Not that people probably _used_ it 
   much, but it's somethign that makes sense in the plumbing)

 - explicit support for "missing objects". We don't do it right now, but 
   we could add it. It was discussed for things like limited history etc 
   (the "shallow clone" kind of thing, before people actually added 
   shallow clones), and it would support the notion of "we export all our 
   history, but for internal reasons we cannot make certain objects 
   available" kinds of workflows.

So right now, rewriting history is an option that you can do. It will 
effectively create a totally new branch (which you can then make into a 
new repository) which has nothing in common with the old branch from the 
point where it was modified. So you can never really merge the two ever 
again, and you need to make sure that everybody who had the old repo 
contents will destroy it.

But at least in theory, it wouldn't be impossible to extend on the 
".git/grafts" kind of setup to say "this object has been consciously 
deleted", and that could in some circumstances be a better model. The 
biggest headache there would be the need to extend the native git protocol 
with a way to add such objects.

			Linus
-

From: Nicolas Pitre
Date: Wednesday, February 21, 2007 - 11:02 am

I think that would be a big security issue.  Right now the GIT history 
can be validated and more importantly trusted from a single commit 
signature.  If poking holes in that model is allowed by the graft 
mechanism, it must remain a local thing and a very conscious one 
otherwise the GIT trust model would be greatly weakened.

If your goal is to remove content froma repository then the only 
sensible way is to rewrite history before publishing.  It is pointless 
to add mechanisms to remove content after it has been distributed.


Nicolas
-

From: Linus Torvalds
Date: Wednesday, February 21, 2007 - 11:13 am

I'm not entirely in disagreement, but I can see the model where some 
company wants to make their work available (with the same history as their 
own internal stuff), but doesn't want to make a single file available for 
some reason.

So they'd have an external thing that just has the file excised.

Now, arguably, it's a lot better to use a "supermodule" approach for 
something like this: have two separate git trees, publish the public one, 
and use an internal supermodule that ties the public and internal trees 
together.

So supermodules might be a way to solve it in a better (and safer - the 
"remove objects from the public tree" thing is very error prone, since if 
you *ever* expose the object by mistake, its now public) way. But I don't 
think the "filter out objects" thing is necessarily fundamentally flawed 
as an approach.

			Linus
-

From: Nicolas Pitre
Date: Wednesday, February 21, 2007 - 11:39 am

Well if you really wanted to do such a thing then you could use a new 
object type that only serves as a stub pretending to be another object 
which SHA1 would have been xyz.  When referenced this object would 
generate a warning indicating to the user that given object has been 
excised out, but otherwise the whole reachability validation would still 
work as usual.

And since this object would be distributed through standard mechanisms 
then there would be no need for protocol extensions.

I don't know if this could help creating SHA1 collisions though.  We've 
dismissed them as highly improbable because the likelihood of a 
collision to hide compromised material would most probably require a 
binary blob somewhere to balance the hash and would hardly be 
compilable/undetected.  But with object stubs with the ability to 
pretend having any possible SHA1 is in fact a nice way to hide 20-byte 
binary blobs in the hash chain possibly making it "easier" to create 
"useful" collisions.  This is where I see a weakening of the trust 
model.


Nicolas
-

From: Michael Hendricks
Date: Wednesday, February 21, 2007 - 11:30 am

What's a decent way to make a branch into a new repository?  My first
inclination is to "cp -a" the existing repository, checkout the branch,
delete all other branches and repack.  That seems to have worked in my
quick test, but is there a better way?

-- 
Michael
-

From: Shawn O. Pearce
Date: Wednesday, February 21, 2007 - 11:37 am

Don't "cp -a" the repository, use git-clone.

And actually, if you just want to pull one branch out into its
own repository you can do something like this:

	mkdir ../theonebranch
	cd ../theonebranch
	git init
	git fetch ../oldstuff theonebranch:master

and you have just the content of `theonebranch` from ../oldstuff
stored here, as master.

Optionally if you now want to actually see the files, you would do:

	git checkout

-- 
Shawn.
-

From: Linus Torvalds
Date: Wednesday, February 21, 2007 - 11:47 am

That works.

As does just "clone repo, delete all unwanted branches, and prune" (of 
course, if you don't want the old repo, you can skip the "clone" part, and 
just do the "delete all unwanted branches and prune" thing).

In some ways, a more straightforward approach may be to just create a new 
repo, and populate it with just one branch (I say "more straightforward", 
not "easier", because I just think it's conceptually simpler):

	mkdir new-repo
	cd new-repo
	git init
	git pull old-repo <branch>

(add "--bare" and "--shared" to taste - with bare repos yu can also do it 
the other way by doing a push into it from outside after you've created 
it, which can be the "logical" way to do it if you want to just publish 
the end result on some shared site)

		Linus
-

From: Linus Torvalds
Date: Wednesday, February 21, 2007 - 11:56 am

Btw, when I say "works", I do mean that "yeah, 'cp -a' works, but 
generally you're better off cloning".

When you use 'cp -a' you have to re-build the index at the very least. It 
so happens that since you checked out the branch explicitly, that will do 
it for you anyway, but it's still often a good idea to just *not* use the 
regular "copy everything by hand" approach.

If you want to be really efficient, there are actually better ways. For 
example, since you want to avoid having any of the old objects even 
reachable by mistake), you're probably better off with an explicit pull of 
the explicit branch, if only because that also involves a re-pack of only 
the reachable objects, and you know that there won't be any reflogs etc 
that might still make the object you try to remove be accessible to people 
who can access the resulting repository directly.

(Yeah, the "cp -a" is faster than the "git pull", but since you want to do 
the packing that git pull does for you *anyway* to get rid of the old 
objects, "git pull" actually ends up being better).

			Linus
-

From: Nicolas Pitre
Date: Wednesday, February 21, 2007 - 11:52 am

Like Shawn said the better way is simply to fetch that branch into a new 
repo.

If you do a cp -a and delete unwanted branches it'll work as well of 
course, but repacking won't get rid of all the data from the believed to 
be deleted branches since some reflog, the HEAD reflog in particular, 
will most probably have references to commits from the removed branches. 
Therefore the pack will still contain that data, at least untill the 
reflog entries expire and get pruned.

Of course if you want to publish just the wanted branch and perform a 
push to a public place then only those objects for that branch will be 
sent like for the fetch case.


Nicolas
-

From: Junio C Hamano
Date: Wednesday, February 21, 2007 - 12:01 pm

While I agree in principle to the argument that there is no
taking it back what's already published, I've heard people
wanting to just stop distributing further, without worrying
about copies already out there.  'missing objects' support would
help us in such a situation.

Supporting 'missing objects' in general would be painful, when
they contain pointers to other objects (i.e. tags, commits, and
trees).

Thinking aloud...

 * missing blob: we can have 'stub blob' objects.  Probably the
   object header for such an object would look like:

	stub <length> NUL
	-----------------
        object <object name of the real blob object>
        type blob

   Hashing a 'stub' object (along with its header as usual, in
   write_sha1_file_prepare()) would instead just report the
   object name recorded there.

   When packing (this applies both to local repacking and
   push/fetch object transfer to other repositories), the stub
   object is included.  delta algorithm would probably not to
   delta other objects with it.

 * missing commit and tag: 'stub object' needs to be extended to
   include these object types, and we would also need 'stub
   commit' and 'stub tag' objects, that copy the structural
   fields from the corresponding true object.  So a stub commit
   would probably look like:

	stub <length> NUL
	-----------------
        object <object name of the real commit object>
        type commit
        tree <object name of the tree contained in the real commit object>
        parent <object name of the first parent in the real commit object>
        parent <object name of the first second in the real commit object>

 * missing tree would only be useful to conceal pathnames
   recorded in the real tree object.  I am not sure if that is
   needed.

 * fsck and verify-pack needs to be taught about 'stub' objects,
   so that they know that their filenames (or the data pointed
   at by pack .idx) do not match the result of hashing them.

If we were to do this, ...
From: Nicolas Pitre
Date: Wednesday, February 21, 2007 - 12:33 pm

I still think this is a "put your head in the sand and pretend that some 
sensitive data never existed in the wild" attitude.  And I really don't 
see the point of supporting that illusion in GIT with technical means.

Either you care about published data or you don't.

If you do then you are screwed anyway irrespective of any missing object 
support we might implement.  There will always be someone somewhere with 
the real thing, and we all know how faster forbidden material does travel 
on the Internet.

If you don't then it is just better to rewrite history and have a clean 
and unambiguous repository.  And because you don't care about existing 
copies you shouldn't bother with the fact that the rewritten repo is not 
compatible with the previously published one.

Sure rewriting history is a potentially expensive operation depending on 
the size and nature of the change, but it is done only once.  And 
actually it can't be _that_ much expensive than a git-repack -a -f.

I think it is much better to provide a tool to properly rewrite history 
than adding support for missing objects and be stuck with them forever.


Nicolas
-

From: Junio C Hamano
Date: Wednesday, February 21, 2007 - 1:22 pm

Well, I think we are in agreement (and that is why I said "I've
heard people wanting").

But it is entirely possible that somebody has a project that is
internal to a company managed for a long time with git, that he
wants to go open source, with (almost) full history.  And the
project may have some proprietary add-on bit which cannot be
published, while building the public bits does not require that
part.  Stubbing things out may help that kind of situation.  The
development team can keep going forward, internally using the
real objects, while pushing stub objects out to the public
repository, without having to rewrite the history and re-partition
the project.

But after having thought about that, I think it would not buy us
much.  You would want to re-partition the project sooner or
later in such a situation *anyway*, so our time is better spent
on giving better support to split existing projects.  It may
already be sufficient in the form of admin-rewritehist, in which
case we can worry about other things ;-).




-

From: Nicolas Pitre
Date: Wednesday, February 21, 2007 - 1:49 pm

It might help, or it might create a management nightmare.  It would be 
really easy to accidentally push the real objects out since a repo with 
them would be indistinguishable from a repo with stubs (that's the 
point of stub objects isn't it?), and because of the distributed nature 
of GIT the leak could come from anyone with access to the private 
objects.

In such a scenario I think it is still more sensible to rewrite the repo 
history before going open source.  You need only to worry about 
isolating the proprietary stuff once.


Nicolas
-

Previous thread: Re: [PATCH 1.5.0.1.37] fix git-remote inconsistent about use of dots in remote names by Bart Trojanowski on Wednesday, February 21, 2007 - 3:04 am. (1 message)

Next thread: [PATCH] git-apply: notice "diff --git" patch again by Junio C Hamano on Wednesday, February 21, 2007 - 3:31 pm. (2 messages)