From (libc.info):
-- Macro: int O_APPEND
The bit that enables append mode for the file. If set, then all
`write' operations write the data at the end of the file, extending
it, regardless of the current file position. This is the only
reliable way to append to a file. In append mode, you are
guaranteed that the data you write will always go to the current
end of the file, regardless of other processes writing to the
file. Conversely, if you simply set the file position to the end
of file and write, then another process can extend the file after
you set the file position but before you write, resulting in your
data appearing someplace before the real end of file.
I don't quote understand how that would help hg (Mercurial) to have
operations like commit, pull/fetch or push atomic, i.e. all or nothing.
In hg you have to update individual files (blobs buckets) storing delta
and perhaps full version, update manifest file (flat tree) and update
changelog (commit): what happens if for example there are two concurrent
operations trying to update repository, e.g. two push operations in parallel
(from two different developers), or fetch from cron and commit? What
happens if operation is interrupted (e.g. lost connection to network during
fetch)?
In git both situations result in some prune-able and fsck-visible crud in
repository, but repository stays uncorrupted, and all operations are atomic
(all or nothing).
--
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git
-
If I remember correctly, thanks to their log-like file format, they can rely on O_APPEND to do the right thing when growing, and aborting the current transaction is just a truncate away (or a set of truncates on the files appended in the transaction, if hg touches more than one log-like file but I do not know if hg uses only one file or more than one). That's one of the things I found clean and beautiful (from theoretical point of view, at least) in their design. I do not think O_APPEND is not used to control concurrent operations. -
Mercurial has write-side locks so there can only ever be one writer at a time. There are no locks needed on the read side, so there can be We keep a simple transaction journal. As Mercurial revlogs are append-only, rolling back a transaction just means truncating all If a Mercurial transaction is interrupted and not rolled back, the result is prune-able and fsck-visible crud. But this doesn't happen much in practice. The claim that's been made is that a) truncate is unsafe because Linux has historically had problems in this area and b) git is safer because it doesn't do this sort of thing. My response is a) those problems are overstated and Linux has never had difficulty with the sorts of straightforward single writer operations Mercurial uses and b) normal git usage involves regular rewrites of data with packing operations that makes its exposure to filesystem bugs equivalent or greater. In either case, both provide strong integrity checks with recursive SHA1 hashing, zlib CRCs, and GPG signatures (as well as distributed "back-up"!) so this is largely a non-issue relative to traditional systems. -- Mathematics is the supreme nostalgia of our time. -
Thanks a lot for complete answer. So Mercurial uses write-side locks for dealing with concurrent operations, and transaction journal for dealing with interrupted operations. I guess that incomplete transactions are rolled back on next hg command... I guess (please correct me if I'm wrong) that git uses "put reference after putting data" scheme, and write-side lock in few places when it Rewrites in git perhaps are (or should be) regular, but need not be often. And with new idea/feature of kept packs rewrite need not be of full data. One command which _is_ (a bit) unsafe in git is git-prune. I'm not sure if it could be made safe. But not doing prune affects only a bit repository size (where git is best I think of all SCMs) and not performance. On the other hand hg repository structure (namely log like append changelog / revlog to store commits) makes it I think hard to have multiple persistent branches. Sidenote 1: it looks like git is optimized for speed of merge and checkout (branch switching, or going to given point in history for bisect), and probably accidentally for multi-branch repos, while Mercurial is optimized for speed of commit and patch. Sidenote 2: Mercurial repository structure might make it use "file-ids" (perhaps implicitely), with all the disadvantages (different renames Integrity checks can tell you that repository is corrupted, but it would be better if it didn't get corrupted in first place. Besides: zlib CRC for Mercurial? I thought that hg didn't compress the data, only delta chain store it? -- Jakub Narebski Poland -
They are either automatically rolled back on abort or if that fails for some reason like power failure the user is prompted to run "hg recover" to complete the rollback. We also save the last transaction Mercurial also uses a "put reference after putting data" which is what If the set of files in a given commit (say tip) gets spread out across an arbitrary number of packs ordered by last modification time, Not sure why you think that. There are some difficulties here, but they're mostly owing to the fact that we've always emphasized the one Obviously. Hence our append-only design. Data that's written to a repo is never rewritten, which minimizes exposure to software bugs and I/O We use zlib compression of deltas and have since April 6, 2005. -- Mathematics is the supreme nostalgia of our time. -
Hierarchical tree objects in git optimize for speed of merge and checkout IMVHO, as you need only to check out one hash to know if you have to descend into subdirectory, or if given subdirectory haven't changed. Flat manifest file in Mercurial (and also "filename buckets") makes How it is so, if the blobs (file contents) are stored filename hashed? IIRC hg has some scheme to deal with renames, but it is file-id (file identity) based AFAIK. -- Jakub Narebski Poland -
No, the buckets are simply the filename. If you rename, you take the pen=
alty of duplicating the content (compressed) with a new name. No big dea=
l there. So there are *no* file-ids. Blobs go into the data/index file =
which corresponds to their filename.
cheers
simon
--=20
Serve - BSD +++ RENT this banner advert +++ ASCII Ribbon /"\
Work - Mac +++ space for low =E2=82=AC=E2=82=AC=E2=82=AC NOW!1 +++=
Campaign \ /
Party Enjoy Relax | http://dragonflybsd.org Against HTML \
Dude 2c 2 the max ! http://golden-apple.biz Mail + News / \
Hi, [culled many people from the Cc: list to avoid a flamewar] So, can you explain to me how a filename is _not_ a file-id? Ciao, Dscho -
It is not a file-id like other SCM use it (I think monotone, not sure tho=
ugh). If you copy/move the content to a new name, the ID will not stay t=
he same. Just see it as a hash bucket which allows you easy access to th=
e history for a file currently with this name.
cheers
simon
--=20
Serve - BSD +++ RENT this banner advert +++ ASCII Ribbon /"\
Work - Mac +++ space for low =E2=82=AC=E2=82=AC=E2=82=AC NOW!1 +++=
Campaign \ /
Party Enjoy Relax | http://dragonflybsd.org Against HTML \
Dude 2c 2 the max ! http://golden-apple.biz Mail + News / \
Hi, Ah, thanks. I misunderstood the meaning of file-id in _that_ context. Ciao, Dscho -
Well, that's actually just another "file ID" too. It's just not an "inode number" kind of file ID, it's more the "CVS file ID" kind of ID. SVN uses "inode numbers" (I think they are just UUID's generated at "svn add" time, but I'm not sure) to track file ID's across renames. Some other SCM's do the same. CVS uses "pathname" as the file ID (which obviously doesn't need any separate generation at all), which is why you have to do horrible things to track file ID's across renames (ie you really can't, but you *can* copy or move the *,v file so that your *new* "file ID" also has the same history as your old one). So both of those are "file ID's" - they are what is used to index into the history, and they have real meaning for very fundamental operations. You can view git as "closer" to CVS, in the sense that it certainly doesn't have the SVN kind of location-independent ID, and it _is_ able to look back in history using the path-name. So in that sense, you can certainly claim that the pathname is the "file ID" in git too, and that git is closer to CVS than to SVN. But unlike SVN or CVS, there is no real fundamental "meaning" to the pathname in git. Sure, you can use the pathname to trace history of a file, but on the other hand, you can use a random aggregation of pathnames to track history of a set of files and directories, and the pathnames actually exist even when the file doesn't. So there obviously isn't any 1:1 relationship, neither in usage, nor in any internal implementation. So at least for me, "file ID" means "identifier for a particular chain of history". THAT exists in both CVS and SVN (it's a pathname and an "inode number" respectively), but does not exist in git at all. Linus -
I think you got this part confused with GNU Arch (and possibly Bzr). SVN tracks renames in the changeset, it records (in the log) a copy and delete. pathname@revision is the only "file ID" I know about in SVN. -- Eric Wong -
Ahh, I was sure the revision files in FSFS were per-file, but coor me corrected - they seem to be per-revision. My bad. Linus -
Well, perhaps I should say that append-log changelog / revlog[*1*] structure to store commits makes it natural to have one branch per repository, as branch (in the lineage of given commit meaning, i.e. all commits which are ancestors of given commit) is roughly equivalent to changelog / revlog and branch tip (latest commit on a branch) is top commit (latest entry) in changelog / revlog. In git, with its DAG (direct acyclic graph) of commits and branch tip as a moving pointer (top of stack pointer like moving) to a commit in DAG makes it natural to have multiple branches in a repository (current branch is branch pointed by HEAD, another pointer - to branch this time[*2*]). Perhaps multiple branch repository makes learning curve a bit steeper, but also encourages using temporary branches and topic branches, which makes _development_ (as opposed to using version control tool) more (power)user-friendly; and makes SCM more powerfull. How Mercurial solves problem of multiple _persistent_ branches? Does it add pointers to commits somewhere deeper in changelog / revlog? By the way, RCS / CVS rewrote relevant data (to have diff from the top structure) on each commit. Nice to know. You compress only file deltas, or also file revision metadata? Do you compress manifests (trees) and commits (or at least commit messages) too? Footnotes: ---------- [*1*] I don't know what nomenclature Mercurial uses for blobs (file contents), trees (directory contents) and commits (revision contents) storage. [*2*] I disregard here latest work on "detached HEAD" in git. -- Jakub Narebski Poland -
What do you mean with that? generate the pack on which occasion? CVS im=
port? I do this already.
cheers
simon
--=20
Serve - BSD +++ RENT this banner advert +++ ASCII Ribbon /"\
Work - Mac +++ space for low =E2=82=AC=E2=82=AC=E2=82=AC NOW!1 +++=
Campaign \ /
Party Enjoy Relax | http://dragonflybsd.org Against HTML \
Dude 2c 2 the max ! http://golden-apple.biz Mail + News / \
I've thought about doing this. Except there are three independent processes occuring during commit that generate objects: update-index write-tree commit-tree and the update-index portion is also git-add, which we have now started to encourage users to do ahead of time as often as needed, prior to running git-commit. Its also the one that generates the largest set of new objects for most projects. One problem comes that we have a rule: "don't delta an object which is already in a pack, unless -f is given". This is one of the reasons `git repack -a -d -l` is so dang fast. Its assuming all new stuff is loose, and therefore should be delta'd, but the old stuff which we have already delta'd is kept as-is. Basically I've thought about doing this (after my work in gfi) and decided its not worth the level of effort involved at this time. So I'm not going to do it. Someone else can try. ;-) -- Shawn. -
Each changeset may have a branch marker. Here's branches in use with an import of mutt's CVS history: $ hg branches mutt-0-94 208:b2cc0abd8fe0 HEAD 207:a505693b54c1 mutt-0-93 134:d59345944030 muttintl 1:29510de8b3fc $ hg co HEAD 176 files updated, 0 files merged, 8 files removed, 0 files unresolved $ hg branch HEAD $ hg branch devel $ hg branch devel All three use the same underlying storage format, so yes. -- Mathematics is the supreme nostalgia of our time. -
By changeset you mean commit-revlog (changelog)?
Where those branch markers are stored? Are those markers moving pointers,
meaning that if you make a commit while on branch, branch marker for
current branch will move?
Static markers cannot identify branch in the presence of branch points:
---a<---b ........ side branch
/
1<---2<---3<---4<---5<---6<---7 ... main branch
^
:
What is the first number? I understand that second is shortened (is it
Git (at least for now) writes nothing on checkout; it is planned that
it would write changes status-like; perhaps summary would be enough...
Revision-controlled (in-tree) tags are inane idea. Tags are non-moving
(and sometimes annotated) pointers to given point in history. They should
not depend on which branch you are, or what version you have checked out.
Otherwise the following would not work:
$ git reset --hard v1.0.0
$ git reset --hard v1.4.4.4
(it could be "git checkout" instead of "git reset --hard" in 'master'
But do you compress metadata (like base of a delta for file deltas,
authorship of a commit and reference to manifest-log entry)? Do manifest
is delta-encoded?
--
Jakub Narebski
Poland
-
And.. they don't! I'm now officially done correcting your uninformed perceptions. Come back when you've actually looked at the docs. -- Mathematics is the supreme nostalgia of our time. -
If that means that you always use the version of .hgtags from the tip (branches are tips of history; they can have different .hgtags), this is also broken; this means for example that you cannot compare current version when on development head (branch) with tag on different branch, those two branches have the same .hgtags file. URL, pretty please? My mistake is caused by the fact that .hgtags is special, i.e. not current version is used (as e.g. with .scmignore files) but version closest to the tip. This means broken abstraction. -- Jakub Narebski Poland -
I meant to write: For example you are on branch 'master', you tag current release e.g. v1.3.4, then you checkout branch 'devel'... and you don't have v1.3.4 tag available unless you merge in .hgtags from 'master'. At least from what I understand of Mercurial tags behaviour. Having to create a commit to remember tag which can be published... I'm not sure if it is a good idea either. Junio creates "GIT 1.4.4.3" commits, ant those are tagges, so perhaps it is not so bad idea either. You encourage to hand-edit .hgtags, but the edited version might not be the one that is used (for example when starting a branch). -- Jakub Narebski Poland -
This would be bad, if it were true. $ hg up devel 2 files updated, 0 files merged, 0 files removed, 0 files unresolved $ cat .hgtags 6acda9aa5d8c621b3db2f2daab878d8de726d227 base $ hg tags tip 4:b1f003583d8e v1.3.4 2:87e43e86318f base 0:6acda9aa5d8c As mentioned before, hg has local tags which sound an awful lot like git tags. It also has properly versioned tags. And, by the way, if you push a branch, you only push the tags that were committed on that branch. Furthermore, you can push based on a tag name that isn't committed in the branch you're pushing. I think the "globally global" nonsense elsewhere in this thread may be a result of not understanding this. I'm probably done with this thread too. There's too much ignorant speculation to make it very productive. -
The above sequence of commands is not enough to reproduce the situation
I want to talk about, namely situation (repository structure) as in
below:
/-\
1---a---2---3---T---t---b .... 'master' branch
\
\-2'--3'--c .... 'devel' branch
where 'a' is branching point (merge base) of 'master' and 'devel'
branches, 'T' is tagged changeset (revision, commit), 't' is commit
where .hgtags with 'T' tag was committed. Changesets (revisions)
'b' and 'c' are tips of 'master' and 'devel' branch, respectively.
If .hgtags was an ordinary file, then at revision marked in above
graph as '2' it wouldn't have tag 'T'. Documentation (Mercurial
HOWTO to be more exact) tells that hg uses .hgtags version from the
tip. But when we are at branch 'devel', the version from the tip
is version 'c' without 'T', not version 'b' with 'T'... if .hgtags
would behave as described in documentation.
It looks however (if what you say above is true also for the situation
as in above graph, i.e. when at 'devel' branch we have 'T' in .hgtags)
that Mercurial always uses _latest_ version of .hgtags file (as in
external wall time, having notihing to do with the history as
represented in repository). But then we cannot say that we can merge
.hgtags file, so it is probably not the case. It is also contrary to
what I gathered from documentation.
If above was true, i.e. .hgtags doesn't behave at all as normal file in
working area, then what the heck it is doing there, and not somewhere
Git tags can be propagated. hg local tags cannot be propagated. hg tags
Reusing in-tree version control to version tags is IMVHO not a good
--
Jakub Narebski
Poland
-
Jakub Narebski wrote: I meant: not somewhere under .hg/ (in repository, and not in working area, if it does not behave as an ordinary working area file) -- Jakub Narebski Poland -
