I've been thinking about these for a while on the back of my
head, and thought it might be better to start writing it down.
A lot of issues involve UI which means it will not materialize
without breaking existing uses, but if we know in advance what
we will be aiming for, maybe we will find a smoother path to
reach there.
* Core data structure
I consider on-disk data structures and on-wire protocol we
currently use are sane and there is not much to fix. There are
certainly things to be enhanced (64-bit .idx offset, for
example), but I do not think there is anything fundamentally
broken and needs to be reworked.
I have the same feeling for in-core data structures in general,
except a few issues.
The biggest one is that we use too many static (worse, function
scope static) variables that live for the life of the process,
which makes many things very nice and easy ("run-once and let
exit clean up the mess" mentality), but because of this it
becomes awkward to do certain things. Examples are:
- Multiple invocations of merge-bases (needs clearing the
marks left on commit objects by earlier traversal),
- Creating a new pack and immediately start using it inside the
process itself (prepare_packed_git() is call-once, and we
have hacks to cause it re-read the packs in many places).
- Visiting more than one repositories within one process
(many per-repository variables in sha1_file.c are static
variables and there is no "struct repository" that we can
re-initialize in one go),
- The object layer holds onto all parsed objects
indefinitely. Because the object store at the philosophy
level represents the global commit ancestry DAG, there is
no inherent reason to have more than one instance of
object.c::obj_hash even if we visit more than one
repositories in a process, but if the two repositories are
unrelated, objects from the repository we were looking at
only waste memory after switching to a different
repostiory.
- The diffcore is not run-once but it is run-one-at-a-time.
This is easy to fix if needed, though.
There are some other minor details but they are not as
fundamental. Examples are:
- The revision traversal is nicely done but one gripe I have is
that it is focused on painting commits into two (and only
two) classes: interesting and uninteresting. If we allowed
more than one (especially, arbitrary number of) kinds of
interesting, answering questions like "which branches does
this commit belong to? which tagged versions is this commit
already included in?" would become more easy and efficient.
show-branch has machinery to do that for a handful but it
could be unified with the revision.c traversal machinery.
- We have at least three independent implementations of
pathspec match logic and two different semantics (one is
component-prefix match, the other is shell glob), and they
should be unified. You can say "git grep foo -- 't/t5*'" but
not "git diff otherbranch -- 't/t5*'".
* Fetch/Push/Pull/Merge confusion
Everybody hates the fact that inverse of push is fetch not pull,
and merge is not a usual Porcelain (while it _is_ usable as a
regular UI command, it was originally done as a lower layer
helper to "pull" Porcelain and has a strange parameter order
with seemingly useless HEAD parameter in the middle).
If I were doing git from scratch, I would probably avoid any of
the above words that have loaded meanings from other SCMs.
Perhaps...
- "git download" would download changes made in the other end
since we contacted them the last time and would not touch our
branches nor working tree (associate the word with getting
tarballs -- people would not expect the act of downloading a
tarball would touch their working tree nor local history.
untarring it does). It is a different story if the end-user
should be required to explicitly say "download"; I am leaning
towards making it more or less transparent.
- "git upload" to upload our changes to the other end -- that
is what "git push" currently does.
- "git join" to merge another branch into the current branch,
with the "per branch configuration" frills to decide what the
default for "another branch" is based on what the current
branch is, etc.
* Less visible "remoteness" of remote branches
If I were doing git from scratch, I would probably have done
separate remotes _the_ only layout, except I might have opted to
make "remotes" even less visible and treating it as merely a
cache of "the branch tips and tags we saw when we connected over
the network to look at them the last time".
So "git branch --list $remote" might contact the remote over the
network or use cached version. When you think about, it it is
not all that different from always contacting the remote end --
the remote end may have mirror propagation delays, and your
local instance of git caching and not contacting the remote all
the time introduces a similar delay on your end which is (1) not
a big deal, and (2) unlike the remote mirror delay, controllable
on your end. For example, you could force it to update the
cache by "git download $remote; git branch --list $remote".
* Unified "fetch" and "push" across backends.
I was rediscovering git-cvsimport today and wished if I could
just have said (syntax aside):
URL: cvs;/my.re.po/.cvsroot
Pull: HEAD:remotes/cvs/master
Pull: experiment:remotes/cvs/experiment
to cause "git fetch" to run git-cvsimport to update the remotes/cvs/
branches (and "git pull" to merge CVS changes to my branches).
The same thing should be possible for SVN and other foreign SCM
backends.
Also it should be possible to use git-cvsexportcommit as a
backend for "git push" into the cvs repository.
That's it for tonight...
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html