I assume that this question has already been addressed on the mailing list, but I wasn't able to find anything about it in the archives. Is it possible to remove content entirely from git's history? I have a client who does not use git for version control. A couple months ago they committed some sensitive client information which should never have been committed. Recently, they realized the mistake and now want to remove all traces of the mistake from history. I would like to migrate them to git at some point. However, if they had been using git for version control already, I'm not sure how I would solved this particular problem. Any suggestions? -- Michael -
No, not once it has been published around to another repository. Since every developer has a copy of the repository its very difficult to remove something, as it must be removed from every developer's repository, and each developer must perform an action to agree to The *only* way to do this in Git is to completely recreate every commit after that point. This changes all commit IDs and basically forks the project into two completely different histories: the one with the bad thing in it, and the one without the bad thing. Users who have the bad thing will continue to have the bad thing until they take explicit action to throw away all of that history and switch to the other one. Now this is actually not a huge deal if you do it on your local repository and go "whoops, I should not have committed that". If you have not yet pushed the commit to another repository (and someone has not yet fetched it from you either) you can use git-rebase to discard it. But once its been pushed/fetched the genie is out of the bottle, and its not going back in. -- Shawn. -
Also it can't have done any (non-fast-forward) merges since then. Reconstructing history with a bunch of merges seems like something that could be a huge pain. (Though with some tools it might be doable.) --b. -
It's not actually that painful, but it *is* expensive.
I wrote git-convert-cache (now "git convert-objects") back when we did the
SHA1/compression switchover changes and the date format translation, so
we've actually had a tool that can do history rewriting pretty much since
day 1 (well, "day 14", to be exact, but still.. April 2005).
BUT:
- I'm not guaranteeing that it works any more. We haven't changed the
fundamental object format since, so that particular program has never
gotten any testing. It still compiles, but does it work? I dunno.
I actually tested it on git itself. It converted the top of the git
tree successfully, and generated a *new* git history. Why? Because it
will actually rewrite the old git tree entries that have permission
0664 into 0644: the *data* will be identical (and no git tools except
for "git fsck --pedantic" will even notice the difference), but the
converted tree avoids one of the legacy decisions that we never fixed
in the git repository itself.
So it works at least to *some* degree, but I would suggest you be very
very careful!
- it can be slow. For something like git, which isn't *that* big, and
where we actually don't need to do a lot of rewriting (ie all the blobs
stay the same, and only a few trees have to be rewritten, and so it's
really just rewriting commits), it's not that bad. It actyally
converted the whole git history in less than ten seconds for me.
But if you have a *huge* tree, and you actually convert objects too
(say, you started using git on Windows before the "autocrlf" thing, and
want to convert the old blobs from CRLF -> LF), it would
(a) require some extensions to convert-object.c to do the blob
conversion
(b) be *much* slower
(c) generate tons of unpacked objects (because git-convert-objects
doesn't know to pack in between, and doesn't use anything
newfangled like "git-fast-import" to do ...Side note: I wasn't entirelyaccurate. The kernel had trees with file mode 0644 for all the early commits, because my umask is 0022. So everything up to commit 4bfa437cf1 is shared after the conversion. But the next one (commit 5dfa9c1b4f) introduced the file include/asm-mips/vr41xx/pci.h with file mode 0664, and I'm not 100% sure why that one happened with that file mode, but as a result, every single commit ever after will have a different SHA1, because the tree got rewritten (and subsequent commits - even if their trees did *not* get rewritten - will obviously have different parent SHA1's). So 56 commits are shared, and "only" 49276 commits were rewritten (and apparently 245 trees). Linus -
One idea Junio and I kicked around on #git a short while ago was to arrange for a pipe between the current Git process and git-fast-import, where the pipe was used from within write_sha1_file() rather than creating the loose object. This way an existing process like git-apply or git-convert-objects could easily spew hundreds of thousands of objects without needing to worry about repacking in the middle; nor would we need to worry about the complexity of trying to disentagle the multiobject packing parts of fast-import into some sort of library. Obviously this is only a good idea if we are going to be making enough objects to warrant using a packfile; small 10-20 bursts of objects from a git-apply doesn't really justify a packfile. But applying 100s of patches in a row might, if we could keep them all fed through the same git-fast-import backend (and thus into the same packfile). -- Shawn. -
The probnlem there is that most conversion scripts that use "write_sha1_file()" will want to *read* that file later. If git-fast-import hasn't generated the pack yet (because it's still waiting for more data), that will not work at all. So then you basically force the conversion script to keep remembering all the old object data (using something like pretend_sha1_file), or you limit it to things that just always re-write the whole object and never need any old object references that they might have written. A lot of conversions tend to be incremental, ie they will depend on the data they converted previously. Linus -
Which is why I was actually thinking of flipping this on its head. Libify git-apply and embed that into fast-import, then one of the native input formats might just be an mbox, or something close enough that a simple C/perl/sed prefilter could make an mbox into the input. fast-import can (and does if necessary) go back to access the packfile it is writing. It has the index data held in memory and uses only OBJ_OFS_REF so that sha1_file.c can unpack deltas just fine, even though we lack an index file and have not completely checksummed the pack itself. So although no other Git process can use the packfile, it is usuable from within fast-import... -- Shawn. -
I'm resurrecting this old thread, as we have come across a similar need and I could not tell if this has been settled. More below... As I understand this thread, it does not appear that a resolution was reached. Our company has content in our central git repository that we need to remove per a contractual obligation. I believe the content in question is limited to one sub-directory, that has existed since (or near to) the beginning of the repo, if that matters. We obviously would just like to issue a "git nuke" operation and be done with it, if that is available. Barring that, we could probably follow reasonably simple steps to purge the content and rebuild the repo. So, what options do we have at present? Bill -
Have you looked at git-filter-branch in a recent version of git? The man page has some good examples. --b. -
Ah, no, though I will do so. It is apparently not in the version I have (1.5.2.4), but it is in 1.5.3.1. We'll give this a shot and complain if we can't handle it. Thank you. Bill -
Hi, git filter-branch. I suggest using the index filter. There is even a nice example in the man page of git filter-branch. Which reminds me that I have some TODOs left in filter-branch... Ciao, Dscho -
It's been discussed. There are two options for doing it: - rewriting history. There are a few tools for this already, and for specific needs it would be fairly easy to resurrect git-convert-objects to do it for any kind of object. See "cg-admin-rewritehist" from cogito for an example of a tool that would do what you need done. In fact, it has this exact thing as the first example. (Btw, I think cg-admin-rewritehist is one of the few things that cogito had that was really a good idea. Not that people probably _used_ it much, but it's somethign that makes sense in the plumbing) - explicit support for "missing objects". We don't do it right now, but we could add it. It was discussed for things like limited history etc (the "shallow clone" kind of thing, before people actually added shallow clones), and it would support the notion of "we export all our history, but for internal reasons we cannot make certain objects available" kinds of workflows. So right now, rewriting history is an option that you can do. It will effectively create a totally new branch (which you can then make into a new repository) which has nothing in common with the old branch from the point where it was modified. So you can never really merge the two ever again, and you need to make sure that everybody who had the old repo contents will destroy it. But at least in theory, it wouldn't be impossible to extend on the ".git/grafts" kind of setup to say "this object has been consciously deleted", and that could in some circumstances be a better model. The biggest headache there would be the need to extend the native git protocol with a way to add such objects. Linus -
I think that would be a big security issue. Right now the GIT history can be validated and more importantly trusted from a single commit signature. If poking holes in that model is allowed by the graft mechanism, it must remain a local thing and a very conscious one otherwise the GIT trust model would be greatly weakened. If your goal is to remove content froma repository then the only sensible way is to rewrite history before publishing. It is pointless to add mechanisms to remove content after it has been distributed. Nicolas -
I'm not entirely in disagreement, but I can see the model where some company wants to make their work available (with the same history as their own internal stuff), but doesn't want to make a single file available for some reason. So they'd have an external thing that just has the file excised. Now, arguably, it's a lot better to use a "supermodule" approach for something like this: have two separate git trees, publish the public one, and use an internal supermodule that ties the public and internal trees together. So supermodules might be a way to solve it in a better (and safer - the "remove objects from the public tree" thing is very error prone, since if you *ever* expose the object by mistake, its now public) way. But I don't think the "filter out objects" thing is necessarily fundamentally flawed as an approach. Linus -
Well if you really wanted to do such a thing then you could use a new object type that only serves as a stub pretending to be another object which SHA1 would have been xyz. When referenced this object would generate a warning indicating to the user that given object has been excised out, but otherwise the whole reachability validation would still work as usual. And since this object would be distributed through standard mechanisms then there would be no need for protocol extensions. I don't know if this could help creating SHA1 collisions though. We've dismissed them as highly improbable because the likelihood of a collision to hide compromised material would most probably require a binary blob somewhere to balance the hash and would hardly be compilable/undetected. But with object stubs with the ability to pretend having any possible SHA1 is in fact a nice way to hide 20-byte binary blobs in the hash chain possibly making it "easier" to create "useful" collisions. This is where I see a weakening of the trust model. Nicolas -
What's a decent way to make a branch into a new repository? My first inclination is to "cp -a" the existing repository, checkout the branch, delete all other branches and repack. That seems to have worked in my quick test, but is there a better way? -- Michael -
Don't "cp -a" the repository, use git-clone. And actually, if you just want to pull one branch out into its own repository you can do something like this: mkdir ../theonebranch cd ../theonebranch git init git fetch ../oldstuff theonebranch:master and you have just the content of `theonebranch` from ../oldstuff stored here, as master. Optionally if you now want to actually see the files, you would do: git checkout -- Shawn. -
That works. As does just "clone repo, delete all unwanted branches, and prune" (of course, if you don't want the old repo, you can skip the "clone" part, and just do the "delete all unwanted branches and prune" thing). In some ways, a more straightforward approach may be to just create a new repo, and populate it with just one branch (I say "more straightforward", not "easier", because I just think it's conceptually simpler): mkdir new-repo cd new-repo git init git pull old-repo <branch> (add "--bare" and "--shared" to taste - with bare repos yu can also do it the other way by doing a push into it from outside after you've created it, which can be the "logical" way to do it if you want to just publish the end result on some shared site) Linus -
Btw, when I say "works", I do mean that "yeah, 'cp -a' works, but generally you're better off cloning". When you use 'cp -a' you have to re-build the index at the very least. It so happens that since you checked out the branch explicitly, that will do it for you anyway, but it's still often a good idea to just *not* use the regular "copy everything by hand" approach. If you want to be really efficient, there are actually better ways. For example, since you want to avoid having any of the old objects even reachable by mistake), you're probably better off with an explicit pull of the explicit branch, if only because that also involves a re-pack of only the reachable objects, and you know that there won't be any reflogs etc that might still make the object you try to remove be accessible to people who can access the resulting repository directly. (Yeah, the "cp -a" is faster than the "git pull", but since you want to do the packing that git pull does for you *anyway* to get rid of the old objects, "git pull" actually ends up being better). Linus -
Like Shawn said the better way is simply to fetch that branch into a new repo. If you do a cp -a and delete unwanted branches it'll work as well of course, but repacking won't get rid of all the data from the believed to be deleted branches since some reflog, the HEAD reflog in particular, will most probably have references to commits from the removed branches. Therefore the pack will still contain that data, at least untill the reflog entries expire and get pruned. Of course if you want to publish just the wanted branch and perform a push to a public place then only those objects for that branch will be sent like for the fetch case. Nicolas -
While I agree in principle to the argument that there is no
taking it back what's already published, I've heard people
wanting to just stop distributing further, without worrying
about copies already out there. 'missing objects' support would
help us in such a situation.
Supporting 'missing objects' in general would be painful, when
they contain pointers to other objects (i.e. tags, commits, and
trees).
Thinking aloud...
* missing blob: we can have 'stub blob' objects. Probably the
object header for such an object would look like:
stub <length> NUL
-----------------
object <object name of the real blob object>
type blob
Hashing a 'stub' object (along with its header as usual, in
write_sha1_file_prepare()) would instead just report the
object name recorded there.
When packing (this applies both to local repacking and
push/fetch object transfer to other repositories), the stub
object is included. delta algorithm would probably not to
delta other objects with it.
* missing commit and tag: 'stub object' needs to be extended to
include these object types, and we would also need 'stub
commit' and 'stub tag' objects, that copy the structural
fields from the corresponding true object. So a stub commit
would probably look like:
stub <length> NUL
-----------------
object <object name of the real commit object>
type commit
tree <object name of the tree contained in the real commit object>
parent <object name of the first parent in the real commit object>
parent <object name of the first second in the real commit object>
* missing tree would only be useful to conceal pathnames
recorded in the real tree object. I am not sure if that is
needed.
* fsck and verify-pack needs to be taught about 'stub' objects,
so that they know that their filenames (or the data pointed
at by pack .idx) do not match the result of hashing them.
If we were to do this, ...I still think this is a "put your head in the sand and pretend that some sensitive data never existed in the wild" attitude. And I really don't see the point of supporting that illusion in GIT with technical means. Either you care about published data or you don't. If you do then you are screwed anyway irrespective of any missing object support we might implement. There will always be someone somewhere with the real thing, and we all know how faster forbidden material does travel on the Internet. If you don't then it is just better to rewrite history and have a clean and unambiguous repository. And because you don't care about existing copies you shouldn't bother with the fact that the rewritten repo is not compatible with the previously published one. Sure rewriting history is a potentially expensive operation depending on the size and nature of the change, but it is done only once. And actually it can't be _that_ much expensive than a git-repack -a -f. I think it is much better to provide a tool to properly rewrite history than adding support for missing objects and be stuck with them forever. Nicolas -
Well, I think we are in agreement (and that is why I said "I've heard people wanting"). But it is entirely possible that somebody has a project that is internal to a company managed for a long time with git, that he wants to go open source, with (almost) full history. And the project may have some proprietary add-on bit which cannot be published, while building the public bits does not require that part. Stubbing things out may help that kind of situation. The development team can keep going forward, internally using the real objects, while pushing stub objects out to the public repository, without having to rewrite the history and re-partition the project. But after having thought about that, I think it would not buy us much. You would want to re-partition the project sooner or later in such a situation *anyway*, so our time is better spent on giving better support to split existing projects. It may already be sufficient in the form of admin-rewritehist, in which case we can worry about other things ;-). -
It might help, or it might create a management nightmare. It would be really easy to accidentally push the real objects out since a repo with them would be indistinguishable from a repo with stubs (that's the point of stub objects isn't it?), and because of the distributed nature of GIT the leak could come from anyone with access to the private objects. In such a scenario I think it is still more sensible to rewrite the repo history before going open source. You need only to worry about isolating the proprietary stuff once. Nicolas -
