Re: newbie questions about git design and features (some wrt hg)

Previous thread: Qt git repository report by Andy Parkins on Wednesday, January 31, 2007 - 2:12 am. (4 messages)

Next thread: Re: Qt git repository report by Jakub Narebski on Wednesday, January 31, 2007 - 3:57 am. (1 message)
From: Jakub Narebski
Date: Wednesday, January 31, 2007 - 3:56 am

From (libc.info):

 -- Macro: int O_APPEND
     The bit that enables append mode for the file.  If set, then all
     `write' operations write the data at the end of the file, extending
     it, regardless of the current file position.  This is the only
     reliable way to append to a file.  In append mode, you are
     guaranteed that the data you write will always go to the current
     end of the file, regardless of other processes writing to the
     file.  Conversely, if you simply set the file position to the end
     of file and write, then another process can extend the file after
     you set the file position but before you write, resulting in your
     data appearing someplace before the real end of file.

I don't quote understand how that would help hg (Mercurial) to have
operations like commit, pull/fetch or push atomic, i.e. all or nothing.
In hg you have to update individual files (blobs buckets) storing delta
and perhaps full version, update manifest file (flat tree) and update
changelog (commit): what happens if for example there are two concurrent
operations trying to update repository, e.g. two push operations in parallel
(from two different developers), or fetch from cron and commit? What
happens if operation is interrupted (e.g. lost connection to network during
fetch)?

In git both situations result in some prune-able and fsck-visible crud in
repository, but repository stays uncorrupted, and all operations are atomic
(all or nothing).
-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git


-

From: Junio C Hamano
Date: Wednesday, January 31, 2007 - 1:01 pm

If I remember correctly, thanks to their log-like file format,
they can rely on O_APPEND to do the right thing when growing,
and aborting the current transaction is just a truncate away (or
a set of truncates on the files appended in the transaction, if
hg touches more than one log-like file but I do not know if hg
uses only one file or more than one).  That's one of the things
I found clean and beautiful (from theoretical point of view, at
least) in their design.  I do not think O_APPEND is not used to
control concurrent operations.

-

From: Matt Mackall
Date: Wednesday, January 31, 2007 - 3:25 pm

Mercurial has write-side locks so there can only ever be one writer at
a time. There are no locks needed on the read side, so there can be

We keep a simple transaction journal. As Mercurial revlogs are
append-only, rolling back a transaction just means truncating all

If a Mercurial transaction is interrupted and not rolled back, the
result is prune-able and fsck-visible crud. But this doesn't happen
much in practice.

The claim that's been made is that a) truncate is unsafe because Linux
has historically had problems in this area and b) git is safer because
it doesn't do this sort of thing. 

My response is a) those problems are overstated and Linux has never
had difficulty with the sorts of straightforward single writer
operations Mercurial uses and b) normal git usage involves regular
rewrites of data with packing operations that makes its exposure to
filesystem bugs equivalent or greater.

In either case, both provide strong integrity checks with recursive
SHA1 hashing, zlib CRCs, and GPG signatures (as well as distributed
"back-up"!) so this is largely a non-issue relative to traditional
systems.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: Jakub Narebski
Date: Wednesday, January 31, 2007 - 4:58 pm

Thanks a lot for complete answer. So Mercurial uses write-side locks
for dealing with concurrent operations, and transaction journal for
dealing with interrupted operations. I guess that incomplete transactions
are rolled back on next hg command...

I guess (please correct me if I'm wrong) that git uses "put reference
after putting data" scheme, and write-side lock in few places when it

Rewrites in git perhaps are (or should be) regular, but need not be often.
And with new idea/feature of kept packs rewrite need not be of full data.

One command which _is_ (a bit) unsafe in git is git-prune. I'm not sure
if it could be made safe. But not doing prune affects only a bit
repository size (where git is best I think of all SCMs) and not performance.

On the other hand hg repository structure (namely log like append changelog
/ revlog to store commits) makes it I think hard to have multiple persistent
branches.

Sidenote 1: it looks like git is optimized for speed of merge and checkout
(branch switching, or going to given point in history for bisect), and
probably accidentally for multi-branch repos, while Mercurial is optimized
for speed of commit and patch.

Sidenote 2: Mercurial repository structure might make it use "file-ids"
(perhaps implicitely), with all the disadvantages (different renames

Integrity checks can tell you that repository is corrupted, but it would
be better if it didn't get corrupted in first place.

Besides: zlib CRC for Mercurial? I thought that hg didn't compress the
data, only delta chain store it?
-- 
Jakub Narebski
Poland
-

From: Matt Mackall
Date: Wednesday, January 31, 2007 - 5:34 pm

They are either automatically rolled back on abort or if that fails
for some reason like power failure the user is prompted to run "hg
recover" to complete the rollback. We also save the last transaction

Mercurial also uses a "put reference after putting data" which is what

If the set of files in a given commit (say tip) gets spread out across
an arbitrary number of packs ordered by last modification time,

Not sure why you think that. There are some difficulties here, but
they're mostly owing to the fact that we've always emphasized the one



Obviously. Hence our append-only design. Data that's written to a repo
is never rewritten, which minimizes exposure to software bugs and I/O

We use zlib compression of deltas and have since April 6, 2005.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: Jakub Narebski
Date: Wednesday, January 31, 2007 - 5:57 pm

Hierarchical tree objects in git optimize for speed of merge and checkout
IMVHO, as you need only to check out one hash to know if you have to
descend into subdirectory, or if given subdirectory haven't changed.
Flat manifest file in Mercurial (and also "filename buckets") makes

How it is so, if the blobs (file contents) are stored filename hashed?
IIRC hg has some scheme to deal with renames, but it is file-id (file
identity) based AFAIK.

-- 
Jakub Narebski
Poland
-

From: Simon 'corecode' Schubert
Date: Thursday, February 1, 2007 - 12:59 am

No, the buckets are simply the filename.  If you rename, you take the pen=
alty of duplicating the content (compressed) with a new name.  No big dea=
l there.  So there are *no* file-ids.  Blobs go into the data/index file =
which corresponds to their filename.

cheers
  simon

--=20
Serve - BSD     +++  RENT this banner advert  +++    ASCII Ribbon   /"\
Work - Mac      +++  space for low =E2=82=AC=E2=82=AC=E2=82=AC NOW!1  +++=
      Campaign     \ /
Party Enjoy Relax   |   http://dragonflybsd.org      Against  HTML   \
Dude 2c 2 the max   !   http://golden-apple.biz       Mail + News   / \

From: Johannes Schindelin
Date: Thursday, February 1, 2007 - 3:09 am

Hi,

[culled many people from the Cc: list to avoid a flamewar]


So, can you explain to me how a filename is _not_ a file-id?

Ciao,
Dscho

-

From: Simon 'corecode' Schubert
Date: Thursday, February 1, 2007 - 3:15 am

It is not a file-id like other SCM use it (I think monotone, not sure tho=
ugh).  If you copy/move the content to a new name, the ID will not stay t=
he same.  Just see it as a hash bucket which allows you easy access to th=
e history for a file currently with this name.

cheers
  simon

--=20
Serve - BSD     +++  RENT this banner advert  +++    ASCII Ribbon   /"\
Work - Mac      +++  space for low =E2=82=AC=E2=82=AC=E2=82=AC NOW!1  +++=
      Campaign     \ /
Party Enjoy Relax   |   http://dragonflybsd.org      Against  HTML   \
Dude 2c 2 the max   !   http://golden-apple.biz       Mail + News   / \

From: Johannes Schindelin
Date: Thursday, February 1, 2007 - 3:49 am

Hi,


Ah, thanks. I misunderstood the meaning of file-id in _that_ context.

Ciao,
Dscho

-

From: Linus Torvalds
Date: Thursday, February 1, 2007 - 9:28 am

Well, that's actually just another "file ID" too. It's just not an "inode 
number" kind of file ID, it's more the "CVS file ID" kind of ID.

SVN uses "inode numbers" (I think they are just UUID's generated at "svn 
add" time, but I'm not sure) to track file ID's across renames. Some other 
SCM's do the same.

CVS uses "pathname" as the file ID (which obviously doesn't need any 
separate generation at all), which is why you have to do horrible things 
to track file ID's across renames (ie you really can't, but you *can* copy 
or move the *,v file so that your *new* "file ID" also has the same 
history as your old one).

So both of those are "file ID's" - they are what is used to index into the 
history, and they have real meaning for very fundamental operations.

You can view git as "closer" to CVS, in the sense that it certainly 
doesn't have the SVN kind of location-independent ID, and it _is_ able to 
look back in history using the path-name. So in that sense, you can 
certainly claim that the pathname is the "file ID" in git too, and that 
git is closer to CVS than to SVN.

But unlike SVN or CVS, there is no real fundamental "meaning" to the 
pathname in git. Sure, you can use the pathname to trace history of a 
file, but on the other hand, you can use a random aggregation of pathnames 
to track history of a set of files and directories, and the pathnames 
actually exist even when the file doesn't. So there obviously isn't any 
1:1 relationship, neither in usage, nor in any internal implementation.

So at least for me, "file ID" means "identifier for a particular chain of 
history". THAT exists in both CVS and SVN (it's a pathname and an "inode 
number" respectively), but does not exist in git at all.

			Linus
-

From: Eric Wong
Date: Thursday, February 1, 2007 - 12:36 pm

I think you got this part confused with GNU Arch (and possibly
Bzr).  SVN tracks renames in the changeset, it records (in the log)
a copy and delete.  pathname@revision is the only "file ID" I know
about in SVN.

-- 
Eric Wong
-

From: Linus Torvalds
Date: Thursday, February 1, 2007 - 2:13 pm

Ahh, I was sure the revision files in FSFS were per-file, but coor me 
corrected - they seem to be per-revision.

My bad.

		Linus
-

From: Jakub Narebski
Date: Friday, February 2, 2007 - 2:55 am

Well, perhaps I should say that append-log changelog / revlog[*1*] structure
to store commits makes it natural to have one branch per repository, as
branch (in the lineage of given commit meaning, i.e. all commits which
are ancestors of given commit) is roughly equivalent to changelog / revlog
and branch tip (latest commit on a branch) is top commit (latest entry)
in changelog / revlog.

In git, with its DAG (direct acyclic graph) of commits and branch tip as
a moving pointer (top of stack pointer like moving) to a commit in DAG
makes it natural to have multiple branches in a repository (current branch
is branch pointed by HEAD, another pointer - to branch this time[*2*]).

Perhaps multiple branch repository makes learning curve a bit steeper,
but also encourages using temporary branches and topic branches, which
makes _development_ (as opposed to using version control tool) more
(power)user-friendly; and makes SCM more powerfull.


How Mercurial solves problem of multiple _persistent_ branches? Does it
add pointers to commits somewhere deeper in changelog / revlog?


By the way, RCS / CVS rewrote relevant data (to have diff from the top
structure) on each commit.


Nice to know. You compress only file deltas, or also file revision
metadata? Do you compress manifests (trees) and commits (or at least
commit messages) too?

Footnotes:
----------

[*1*] I don't know what nomenclature Mercurial uses for blobs (file
contents), trees (directory contents) and commits (revision contents)
storage.

[*2*] I disregard here latest work on "detached HEAD" in git.

-- 
Jakub Narebski
Poland
-

From: Simon 'corecode' Schubert
Date: Friday, February 2, 2007 - 6:51 am

What do you mean with that?  generate the pack on which occasion?  CVS im=
port?  I do this already.

cheers
  simon

--=20
Serve - BSD     +++  RENT this banner advert  +++    ASCII Ribbon   /"\
Work - Mac      +++  space for low =E2=82=AC=E2=82=AC=E2=82=AC NOW!1  +++=
      Campaign     \ /
Party Enjoy Relax   |   http://dragonflybsd.org      Against  HTML   \
Dude 2c 2 the max   !   http://golden-apple.biz       Mail + News   / \

From: Jakub Narebski
Date: Friday, February 2, 2007 - 7:23 am

On commit.
-- 
Jakub Narebski
Poland
-

From: Shawn O. Pearce
Date: Friday, February 2, 2007 - 8:02 am

I've thought about doing this.  Except there are three independent
processes occuring during commit that generate objects:

	update-index
	write-tree
	commit-tree

and the update-index portion is also git-add, which we have now
started to encourage users to do ahead of time as often as needed,
prior to running git-commit.  Its also the one that generates the
largest set of new objects for most projects.

One problem comes that we have a rule: "don't delta an object
which is already in a pack, unless -f is given".  This is one of
the reasons `git repack -a -d -l` is so dang fast.  Its assuming
all new stuff is loose, and therefore should be delta'd, but the
old stuff which we have already delta'd is kept as-is.


Basically I've thought about doing this (after my work in gfi)
and decided its not worth the level of effort involved at this time.
So I'm not going to do it.  Someone else can try.  ;-)

-- 
Shawn.
-

From: Matt Mackall
Date: Friday, February 2, 2007 - 9:03 am

Each changeset may have a branch marker.

Here's branches in use with an import of mutt's CVS history:

$ hg branches
mutt-0-94                      208:b2cc0abd8fe0
HEAD                           207:a505693b54c1
mutt-0-93                      134:d59345944030
muttintl                       1:29510de8b3fc
$ hg co HEAD
176 files updated, 0 files merged, 8 files removed, 0 files unresolved
$ hg branch
HEAD
$ hg branch devel
$ hg branch
devel


All three use the same underlying storage format, so yes.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: Jakub Narebski
Date: Friday, February 2, 2007 - 10:18 am

By changeset you mean commit-revlog (changelog)? 

Where those branch markers are stored? Are those markers moving pointers,
meaning that if you make a commit while on branch, branch marker for
current branch will move?

Static markers cannot identify branch in the presence of branch points:

                   ---a<---b ........ side branch
                  /
  1<---2<---3<---4<---5<---6<---7 ... main branch
            ^
            :   

What is the first number? I understand that second is shortened (is it

Git (at least for now) writes nothing on checkout; it is planned that
it would write changes status-like; perhaps summary would be enough...

Revision-controlled (in-tree) tags are inane idea. Tags are non-moving
(and sometimes annotated) pointers to given point in history. They should
not depend on which branch you are, or what version you have checked out.

Otherwise the following would not work:
 $ git reset --hard v1.0.0
 $ git reset --hard v1.4.4.4
(it could be "git checkout" instead of "git reset --hard" in 'master'

But do you compress metadata (like base of a delta for file deltas,
authorship of a commit and reference to manifest-log entry)? Do manifest
is delta-encoded?

-- 
Jakub Narebski
Poland
-

From: Matt Mackall
Date: Friday, February 2, 2007 - 10:37 am

And.. they don't!

I'm now officially done correcting your uninformed perceptions. Come
back when you've actually looked at the docs.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: Jakub Narebski
Date: Friday, February 2, 2007 - 11:44 am

If that means that you always use the version of .hgtags from the tip
(branches are tips of history; they can have different .hgtags),
this is also broken; this means for example that you cannot compare
current version when on development head (branch) with tag on different
branch, those two branches have the same .hgtags file.


URL, pretty please?

My mistake is caused by the fact that .hgtags is special, i.e. not
current version is used (as e.g. with .scmignore files) but version
closest to the tip. This means broken abstraction.

-- 
Jakub Narebski
Poland
-

From: Jakub Narebski
Date: Friday, February 2, 2007 - 12:56 pm

I meant to write:


For example you are on branch 'master', you tag current release
e.g. v1.3.4, then you checkout branch 'devel'... and you don't have
v1.3.4 tag available unless you merge in .hgtags from 'master'.
At least from what I understand of Mercurial tags behaviour.

Having to create a commit to remember tag which can be published...
I'm not sure if it is a good idea either. Junio creates "GIT 1.4.4.3"
commits, ant those are tagges, so perhaps it is not so bad idea
either.

You encourage to hand-edit .hgtags, but the edited version might
not be the one that is used (for example when starting a branch).

-- 
Jakub Narebski
Poland
-

From: Brendan Cully
Date: Saturday, February 3, 2007 - 1:06 pm

This would be bad, if it were true.

$ hg up devel
2 files updated, 0 files merged, 0 files removed, 0 files unresolved
$ cat .hgtags
6acda9aa5d8c621b3db2f2daab878d8de726d227 base
$ hg tags
tip                                4:b1f003583d8e
v1.3.4                             2:87e43e86318f
base                               0:6acda9aa5d8c

As mentioned before, hg has local tags which sound an awful lot like
git tags. It also has properly versioned tags. And, by the way, if you
push a branch, you only push the tags that were committed on that
branch. Furthermore, you can push based on a tag name that isn't
committed in the branch you're pushing. I think the "globally global"
nonsense elsewhere in this thread may be a result of not understanding
this.

I'm probably done with this thread too. There's too much ignorant
speculation to make it very productive.
-

From: Jakub Narebski
Date: Saturday, February 3, 2007 - 1:55 pm

The above sequence of commands is not enough to reproduce the situation
I want to talk about, namely situation (repository structure) as in 
below:

                    /-\ 
   1---a---2---3---T---t---b   .... 'master' branch
        \ 
         \-2'--3'--c           .... 'devel' branch

where 'a' is branching point (merge base) of 'master' and 'devel' 
branches, 'T' is tagged changeset (revision, commit), 't' is commit
where .hgtags with 'T' tag was committed. Changesets (revisions)
'b' and 'c' are tips of 'master' and 'devel' branch, respectively.

If .hgtags was an ordinary file, then at revision marked in above
graph as '2' it wouldn't have tag 'T'.  Documentation (Mercurial
HOWTO to be more exact) tells that hg uses .hgtags version from the
tip.  But when we are at branch 'devel', the version from the tip
is version 'c' without 'T', not version 'b' with 'T'... if .hgtags
would behave as described in documentation.

It looks however (if what you say above is true also for the situation 
as in above graph, i.e. when at 'devel' branch we have 'T' in .hgtags)
that Mercurial always uses _latest_ version of .hgtags file (as in 
external wall time, having notihing to do with the history as 
represented in repository). But then we cannot say that we can merge
.hgtags file, so it is probably not the case. It is also contrary to 
what I gathered from documentation.

If above was true, i.e. .hgtags doesn't behave at all as normal file in 
working area, then what the heck it is doing there, and not somewhere 

Git tags can be propagated. hg local tags cannot be propagated. hg tags 

Reusing in-tree version control to version tags is IMVHO not a good 


-- 
Jakub Narebski
Poland
-

From: Jakub Narebski
Date: Saturday, February 3, 2007 - 2:00 pm

Jakub Narebski wrote:

I meant: not somewhere under .hg/ (in repository, and not in working area,
if it does not behave as an ordinary working area file)
-- 
Jakub Narebski
Poland
-

Previous thread: Qt git repository report by Andy Parkins on Wednesday, January 31, 2007 - 2:12 am. (4 messages)

Next thread: Re: Qt git repository report by Jakub Narebski on Wednesday, January 31, 2007 - 3:57 am. (1 message)