Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch)

Previous thread: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch) by Stephen Bash on Monday, October 18, 2010 - 5:40 pm. (1 message)

Next thread: git subcommand sigint gotcha by Joey Hess on Monday, October 18, 2010 - 9:53 pm. (10 messages)
From: Stephen Bash
Date: Monday, October 18, 2010 - 6:42 pm

To be precise: svn-fe creates commits where
  git diff-tree treeA treeB


I have 32 SVN revs in my history that touch multiple Git commit objects.  The simplest example is
  svn mv svn://svnrepo/branches/badBranchName svn://svnrepo/branches/goodBranchName




I'm glad it's stimulating conversation.  I'm beginning to wonder if there might be competing design goals for one-way vs. two-way compatibility...  Performance is one place where opinions probably greatly differ (I didn't mind taking an extra 30 minutes to mirror my SVN repo because it probably saved more than that in communication overhead later in the process, but that mirror operation is very taxing on your timeline); my exhaustive search of all SVN copies is another (I wanted to be *extremely* certain I knew about all the misplaced branches/tags, but it's inefficient for a casual developer who just wants to interact with an SVN server).  It's all just food for thought, and I'm happy to carry on the conversation from my different point-of-view :)

Thanks,
Stephen
--

From: Ramkumar Ramachandra
Date: Monday, October 18, 2010 - 11:42 pm

Hi Stephen,


Yep, they're certainly two different ways to approach the problem: I'd
be interested in investigating why it will produce different
results. Since we both agree that it's easier (and faster) to do it in
Git-land, I'm looking into the the areas where it falls short.


Right, that IS expected behavior. Don't they correspond to separate
SVN revisions anyway? Why would you want to squash them?



Ouch! Thanks for the illustrative example- I understand now. We have
to bend backwards to perform a one-to-one mapping. It's finally struck
me- one-to-one mapping is nearly impossible to achieve, and I don't
know if it makes sense to strive for it anymore. Looks like Jonathan

Um, there's just one commit that deviates from the branch it's based
on (but you don't know that, and I should have been clearer): look at
contrib/svn-fe/svn-filter-root.py

It's just a minimalistic mapper, but it's fast and done nicely. You


When I made this comment, I was thinking of the one-to-one mapping. It

Ok, I still don't get this part- why mirror at all? Can't all the
information be mined out of the in-memory tree that svn-fe builds
while parsing the dumpfile? From the SVN-side, all that's required is
a streaming dumpfile like the one that `svnrdump dump` produces.

-- Ram
--

From: Will Palmer
Date: Wednesday, October 20, 2010 - 1:39 am

It's been a while since I was involved in this discussion, so maybe the
design has changed by now, but I was under the impression that there
would be one "one-to-one" mapping branch (which would never be checked
out), containing the history of /, and that the "real" git branches,
tags, etc, would be based on the trees originally referenced by the root
checkout, with git-notes (or similar) being used to track the weirdness
in mappings. How does the "multiple branches touched in a single commit"
complicate anything other than the heuristics for automatic branch
detection (which I assume nobody is at the stage of talking about yet).

I suppose we wouldn't be talking, technically, about a one-to-one
mapping in that case, as we would be turning "one" svn revision into
"many" git branches, but in the conceptual sense of "one svn repository
equals one git repository", I don't see this as being impossible, or so
difficult that it shouldn't be striven-for.

Something else which is at least semi-common in svn is to treat a folder
both as a "directory" and a "branch", which the "checking out /" example
would just be an extreme example of. Think in terms of git branches
being a "view" of the history, with some mapper sitting between each
view and "root" checkout.

--

From: Jakub Narebski
Date: Wednesday, October 20, 2010 - 4:59 am

I think there might be a problem in that in git commit is defined by
its parents and its final state, while revision in Subversion is IIRC
defined by change.  Isn't it?

-- 
Jakub Narebski
Poland
ShadeHawk on #git
--

From: Will Palmer
Date: Wednesday, October 20, 2010 - 6:42 am

A "change" is a delta between one state and another, so each revision is
dependent on those which came before it just as much as a a git commit
is. An svn "revision" is a snapshot, regardless of how it is stored, ie,
the "svn stores changes, git stores snapshots" is an implementation
detail. It's a detail which makes a lot of things easier/faster in git
than they would be in svn, but a mere detail none the less.

The difference of course is that the "name" of an svn revision stays the
same even if aspects of that revision (for example, the commit message)
are changed, while the "name" of a git commit is dependent on everything
that makes up a commit. In git terms, changing a commit message is
considered to be history rewriting, whereas in svn terms it is merely
something which happens occasionally as part of regularly maintained
repository.

the git Philosophy is ingrained in its object model: If you change
something which led to a state, you change the state itself. I don't
think there should be an attempt to work-around that philosophy when
talking to external repositories. That is to say: if a commit message
(or other revprop) in history changes, we want to treat it as if we were
recovering from an upstream rebase. Of course, a problem in that could
very well be "how would we know about it?", which is a good question,
but one not directly related to [revision+directory]<->[commit]
mappings, afaik ;)

--

From: Jakub Narebski
Date: Wednesday, October 20, 2010 - 1:44 pm

Thanks for the correction, and for explanation.

The problem with one-to-one [SVN revision]<->[Git commit] mapping in the
situation of Subversion mishandling described by Stephen Bash persist,
though the problem is not because "svn stores changes, git stores
snapshots", but because of widely different model of branches.


Subversion uses the inter-file branching model (Wikipedia says it was
"borrowed" from Perforce) to handle branches and tags.  It uses "branches
are copies (folders)" paradigm, and technically it doesn't have separate
namespace for branches but have projects, branches, and projects' 
filesystem hierarchy mixed together; what part of path is branch name
is defined by convention only.  This model makes it easy to mess up
repository (because there are no technological barriers for going 
against conventions, like mentioned all-branches change, or changing
tags, or reversed hierarchy or branches and projects).

Because (from what I understand) revisions in Subversion are whole
project all-branches snapshots, and because revision identifiers are
monotonically incrementing numbers, there is no inherent notion of
_parent_ of commit, like there is in Git.  (I think that was the reason
why merge tracking was absent from Subversion until version 1.5, and
why mergeinfo is per-file rather than per-commit/per-revision property).


In Git commits store snapshot of top level of a project (contrary to
revisions in Subversion being snapshot of top level of repository tree,
all branches and tags in it).  Each commit in Git also stores its parent
or parents.  Those commit-to-parent links make up DAG (Directed Acyclic
Graph) of revisions.  Branches in Git reside in separate namespace,
and are live pointers (like e.g. top pointer in stack implementations)
to commits; commit that branch points to (the tip of branch) marks out
subset of DAG of revisions: all descendants of given commits - this form
a line of development i.e. branch.

What is important here is that commit is ...
From: mrevilgnome
Date: Wednesday, October 20, 2010 - 6:54 pm

I agree.  The repository that I'm interested in converting has
branches all over the place /sandbox/, /sandbox/<username>/*,
/stable/MAIN/*, /stable/Features/*,  /features/*, /branches/*, etc...
Because subversion didn't enforce the convention it was all to easy to
ignore when our questionable branching strategy was created.  Instead
of expecting sub-folders of a particular path to be a branch is there
something that we can key off of in the dumpfile?  Are copy operations
--

From: Jakub Narebski
Date: Thursday, October 21, 2010 - 1:16 am

Actually it shouldn't be that hard to implement, it it isn't already
implemented in svn-fe.

We don't need to have copy operations notated in some fashion; it should
be enough to tell svn-fe where the top directory of project is in 
repository tree hierarchy (e.g. that it is at /stable/MAIN/* at
revision 1).  git-fe can/could use then 'tree' movement detection that 
'subtree' merge strategy uses.

-- 
Jakub Narebski
Poland
--

From: Will Palmer
Date: Thursday, October 21, 2010 - 2:08 am

To clarify, I was saying that there is a "parent" of each SVN commit, in
the top-level sense. This can be easily converted into a "whole
repository" ("svnroot") tree in git. Of course, this isn't useful for
actual work, but it's a good middle-layer, from which other more-useful
things can be derived.

In terms of converting the svnroot git history into actual branches,
there are several options for mapping things. Ignoring merges for a
moment, we could (for example) notice when two trees (as in tree
objects) are very similar at some point in history, and decide that
those are probably branches. It's tedious, but still fairly simple, to
walk the history and build a new history consisting only of edits to a
subtree (even if the commit messages don't always make sense out of
context). It really doesn't matter one lick whether a single svn commit
touched multiple generated git commits.

Of course, "ignoring merges" is temporary and a total cop-out, but I
wouldn't for a moment pretend that converting svn branches into git


Also correct. One SVN commit would logically map to several git commits.
It's best to think in terms of:
([svn commit] + [svn path]) -> [git commit] (or git tag, if we can get

I'm not entirely familiar with the git replace mechanism, but wouldn't
that mean that repository git-A (cloned from SVN before the property
change) and repository git-B (cloned from SVN after the property change)
would be unable to merge with each-other?
In my mind, if it would be a rebase when it happens in git-land, it
should be a rebase when it happens in

Any sufficiently large SVN-tracked project will use all of SVN's
features, whether the maintainer remembers or not ;)

Certainly it could be a "few and far between" thing, which doesn't need
to be handled to get going / usable (especially since creating a fresh
clone is so much faster than with git-svn). I don't know the internals
of SVN beyond what was mentioned in the manual 5 or so years ago, but I
assume you'd need to ...
From: Jakub Narebski
Date: Thursday, October 21, 2010 - 8:52 am

"Whole repository hierarchy (snvroot) snapshots" are useless without
extra work; Git needs "whole project" snapshots for its commits.

But the whole long description of "branching" model in Subversion was
meant as intro for explanation why there can be mishandled commits
in Subversion, which make it impossible to have 1-to-1 SVN revision to

Actually as Stephen Bash wrote in his response creating branches in
Subversion generates 'copy' operations in svndump... we have to filter

We would have to ensure that commits in Git in branch 'foo' are the same
as history of 'project/branches/foo' subtree in svnroot in Subversion.
Otherwise we would either have different history in Git and in Subversion,

I don't think the most common "sane" Subversion merge case would be
difficult to translate into merge commit in Git: the svn:mergeinfo 
property would have common revisions for all affected files/directories.

The problem is that like it is possible to mishandle commit like described
by Stephen Bash by creating all-branches revision, it is also possible
to mishandle merge in Subversion, creating revision where different files
are merged from different branches: such thing does not have easy
translation to Git commit-level rather than file-level merge tracking.
 

If I remember correctly some of discussion was whether there can truly
be irrecovable situation where single SVN revision *must* be mapped into

Note that there is problem with possibly changing svn:log, svn:author and
svn:date revision properties is only when there is ongoing interaction
between Subversion repository (or mirror) and Git repository (or mirror).
There is no problem with this issue when doing one-shot conversion.
 
The major problem is that svn:log etc. are _unversioned_ properties (see
http://svnbook.red-bean.com/en/1.5/svn.ref.properties.html), so I am not
sure if there is a way for Subversion server to tell that some svn:log
properties changed.  Perhaps there is a log, even if properties ...
From: Jonathan Nieder
Date: Thursday, October 21, 2010 - 9:16 am

There has been brief discussion of that possibility on the Subversion
list [1]:

 "What we might need is an RA call that has
  the server provide the N last revisions to have undergone revprop edits..."

I'm guessing that there is not such a log now but the developers might
be open to a patch adding such a log (for the sake of svnsync and

Yes, exactly.  In some cases, this "git replace" step would have to be
accomplished by a separate command (or even "by hand") to get the job
done:

 alice> git clone svn://svn.example.com/
 upstream> svnadmin propedit ...
 bob> git clone svn://svn.example.com/

In this situation, alice and bob have diverging histories, just as
if upstream had rewritten history (because, well, upstream has).

Now if alice fetches from bob and notices that, then she must do

 alice> git replace AA BB

(or its user-friendly equivalent, or a batch equivalent to search for
and handle cases like this). 


Exactly.  Well, one can mitigate the performance problems by running
"git filter-branch" every once in a while. :)

Regards,
Jonathan

[1] http://thread.gmane.org/gmane.comp.version-control.subversion.devel/122840/focus=122944
--

From: Ramkumar Ramachandra
Date: Wednesday, October 20, 2010 - 7:05 am

Hi Will,


Yep, and I'm to blame for that- sorry I didn't CC you earlier. I got
confused between "Tomas Carnecky" and "Will Palmer". To avoid this
confusion in future, I'd request everyone to display the names they
use on the list in the IRC whois information (unless it's a privacy

Yeah, that was my plan too originally, but I clearly haven't thought
about it enough. I'm currently noting down the various scenarios that
the others are quoting -- there are quite a few I hadn't thought about
earlier.

[...]

-- Ram
--

Previous thread: Re: Converting to Git using svn-fe (Was: Speeding up the initial git-svn fetch) by Stephen Bash on Monday, October 18, 2010 - 5:40 pm. (1 message)

Next thread: git subcommand sigint gotcha by Joey Hess on Monday, October 18, 2010 - 9:53 pm. (10 messages)