In http://thread.gmane.org/gmane.comp.version-control.git/42479, a birdview on the source code was requested. J. Bruce Fields suggested that my reply should be included in the user manual, and there was nothing of an outcry, so here it is, not even 2 months later. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> --- Documentation/user-manual.txt | 196 +++++++++++++++++++++++++++++++++++++++++ 1 files changed, 196 insertions(+), 0 deletions(-) diff --git a/Documentation/user-manual.txt b/Documentation/user-manual.txt index 67f5b9b..3c67b68 100644 --- a/Documentation/user-manual.txt +++ b/Documentation/user-manual.txt @@ -3161,6 +3161,202 @@ confusing and scary messages, but it won't actually do anything bad. In contrast, running "git prune" while somebody is actively changing the repository is a *BAD* idea). +[[birdview-on-the-source-code]] +A birdview on Git's source code +----------------------------- + +While Git's source code is quite elegant, it is not always easy for +new developers to find their way through it. A good idea is to look +at the contents of the initial commit: +_e83c5163316f89bfbde7d9ab23ca2e25604af290_ (also known as _v0.99~954_). + +Tip: you can see what files are in there with + +---------------------------------------------------- +$ git show e83c5163316f89bfbde7d9ab23ca2e25604af290: +---------------------------------------------------- + +and look at those files with something like + +----------------------------------------------------------- +$ git show e83c5163316f89bfbde7d9ab23ca2e25604af290:cache.h +----------------------------------------------------------- + +Be sure to read the README in that revision _after_ you are familiar with +the terminology (<<glossary>>), since the terminology has changed a little +since then. For example, we call the things "commits" now, which are +described in that README as "changesets". + +Actually a lot of the structure as it is now can be explained by that ...
Either it should be "unsigned char[40]" (or possibly 41 with a terminating \0), or else you shouldn't be talking about hexadecimal since it's just a 20-byte big-endian unsigned integer. (A third possibility is that I'm totally confused.) -- Karl Hasselstr
Hi, On Tue, 8 May 2007, Karl Hasselstr
On 2007-05-08 23:07:04 +0200, Johannes Schindelin wrote: > On Tue, 8 May 2007, Karl Hasselstr
Hi, On Tue, 8 May 2007, Karl Hasselstr
On 2007-05-09 01:10:13 +0200, Johannes Schindelin wrote: > On Tue, 8 May 2007, Karl Hasselstr
SHA-1 is defined as producing a octet sequence, and to have a canonical=20 hex digit sequence conversion with the high nibbles first. Internally, it= =20 is canonically specified using big-endian math, but the same algorithm=20 could equally be specified with little-endian math and different rules for= =20 It's kind of important to distinguish between the hex representation and=20 the octet representation, because your code will not work at all if you=20 use the wrong one. And "unsigned char *" or "unsigned char[20]" is always= =20 the octets; the hex is always "char *". Primarily mentioning the one that= =20 is more intuitive but less frequently used doesn't help with understanding= =20 the actual code. =09-Daniel *This .sig left intentionally blank*
Hi, > > On Tue, 8 May 2007, Karl Hasselstr
That'll address my complaint nicely, I believe. It was the confusion between these two formats that I was trying to get at. -- Karl Hasselstr
Looks helpful, concise, and to the point. Neat-o. Acked-by: J. Bruce Fields <bfields@citi.umich.edu> Might want to add "in a recent commit"?--it's not clear that you've Unless the reader has already been hanging out on the mailing list a while, "most libified" may not mean much to them yet at this point. The organization of the next bit is slightly confusing: we're set up to expect a longer lecture on the revision walker, but instead there's just the historical note on git-rev-list, a mention of 'revision.c', 'revision.h', and 'struct rev_info', and then it rapidly digresses into discussing builtins. Which actually is fine, but just a few small markers of where we are in the discussion might be reassuring--a section header or two, maybe a little more emphasis on the pointers you're giving, like: "take a moment to go read revision.h and revision.c now, paying special attention to struct rev_info, which ....". --b. -
I had the same impression. I was meaning to write a "code walkthru for git hackers and wannabes" with target audience quite different from the user-manual. My idea of which areas to cover in what order seems to match with what Johannes started. - sha1_name.c; - read_sha1_file(); - revision.c::setup_revisions() to talk about parsing but not about walking yet. - start from builtin-merge-base.c into commit.c to talk about revision traversal done by get_merge_bases(). This codepath is much simpler than the revision.c machinery and is a good primer to understand the latter. - builtin-diff-tree.c to show one tree and two tree cases, go into log-tree.c then tree-diff.c to show the use of add_remove() and change() callbacks, and then finally talk about diff_flush(), without talking about diffcore transformations yet. - start from builtin-log.c to review the setup_revisions(), then talk about prepare_revision_walk() and get_revision() machinery, first pass without talking about path limiting and then with path limiting. - fetch-pack.c and upload-pack.c to talk about the native protocol over ssh and local forking, how revision traversal machinery is used, the "objects pointed by refs are complete" contract. - daemon.c to see how upload-pack is invoked. - read_cache(), active_cache[], active_nr and friends; - update-index and write-tree, including how cache-tree optimizes tree writing after small updates. Advanced students can also look at git-apply here. - unpack-trees.c and builtin-read-tree.c to talk about index stages. - diffcore transformations, especially diffcore-rename. - merge-recursive -
Having said that, I do not think the patch belongs to the "git USER'S manual". It is a very good introductory material for a separate "git hackers manual", though. -
Hi, That is what I was referring to when I mentioned "no outcry". Bruce said that he liked the idea to have something like that in the USER's manual. And I have to agree: There might be enough room to actually go and write a Git hacker's manual, but IMHO that takes a lot of time which has to be found at first. And even if we actually have such a hacker's manual one day, this "sneak preview" in the user's manual does not hurt, but could actually entice people to read that manual, too. Ciao, Dscho -
Well, we could remove the word "user" from the name. There's a pretty good continuum between users and hackers, and that's as it should be. But that would be OK too, as long as we have a clear idea how we're going to decide what goes in which manual. No need to wait until the whole thing's done--we could commit an initial version of it Johannes's work and your outline and fill in the rest later. --b. -
Yeah, I like it too. I forgot to say that, and got right to the criticism instead. But better late than never, I hope. :-) -- Karl Hasselstr
Hi, How about a sentence way before that, when I talk about the initial commit, like this: In the early days, Git (in the tradition of UNIX) was a bunch of programs which were extremely simple, and which you used in scripts, piping the output of one into another. This turned out to be good for initial development, since it was easier to test new things. However, recently many of these parts have become builtins, and some of the core has been "libified", i.e. put into libgit.a for Okay. I hope I will be able to make these changes until tomorrow (I will be gone for a few days after that). Ciao, Dscho -
Hi, for your reviewing pleasure, I made a patch on top of the original one, but I can easily provide a full patch for application. -- [PATCH] user-manual: Touch ups on the birdview section ... as suggested by J. Bruce Fields, Karl Hasselström and Daniel Barkalow. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> --- Documentation/user-manual.txt | 31 +++++++++++++++++++++++++++---- 1 files changed, 27 insertions(+), 4 deletions(-) diff --git a/Documentation/user-manual.txt b/Documentation/user-manual.txt index 2d58bb0..55934db 100644 --- a/Documentation/user-manual.txt +++ b/Documentation/user-manual.txt @@ -3197,7 +3197,15 @@ basically _the_ header file which is included by _all_ of Git's C sources. If you grasp the ideas in that initial commit (it is really small and you can get into it really fast, and it will help you recognize things in the much larger code base we have now), you should go on skimming `cache.h`, -`object.h` and `commit.h`. +`object.h` and `commit.h` in the current version. + +In the early days, Git (in the tradition of UNIX) was a bunch of programs +which were extremely simple, and which you used in scripts, piping the +output of one into another. This turned out to be good for initial +development, since it was easier to test new things. However, recently +many of these parts have become builtins, and some of the core has been +"libified", i.e. put into libgit.a for performance, portability reasons, +and to avoid code duplication. By now, you know what the index is (and find the corresponding data structures in `cache.h`), and that there are just a couple of object types @@ -3236,9 +3244,22 @@ options that were relevant for the different plumbing commands that were called by the script. Most of what `git-rev-list` did is contained in `revision.c` and -`revision.h`. It wraps the options in a struct named rev_info, which +`revision.h`. It wraps the options in a struct named `rev_info`, which ...
I disagree, especially with the past tense of the first half of the paragraph. Git is _still_ a bunch of programs you use in scripts, piping the output of one into another. Another point is that implementation-wise many of the code is currently shared in an internal library, etc. I'd be a bit careful to talk about libgit.a so leisurely since it might give the reader an impression that there really _is_ "the git library", with API and everything, that they can use externally. Of course you need to mention libgit.a, but I'd also mention that it is so far meant To be honest, I wouldn't even be *thinking* about the endianity of SHA-1 octet representation (you don't usually really deal with the hash as with a number, so expecting to have it in native endianity is not very natural; you just deal with it as with a data blob) and the "(big-endian)" would only confuse me and get me thinking about "huh, do they swap the bytes, or wait, they don't, ...?!". But that's maybe just me. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ Ever try. Ever fail. No matter. // Try again. Fail again. Fail better. -- Samuel Beckett -
Hi, No. Many parts are _not_ simple programs piped into each other. git-log, Frankly, this is just a birdview thing. If you want to go and make a But then, maybe it is just me? I got it completely wrong the first time, fully expecting the calculations to be carried out in host endianness for performance reasons. Ciao, Dscho -
I think the Mozilla implementation carries out calculations in host endianness, and transfers data from the input to the internal state and from the internal state to the final hash with shifts and masks. Which calculations are you seeing that involve byte order? -Daniel *This .sig left intentionally blank* -
Hi, None. I only suspected them to be carried out in byte order. From what I know, there are some shifts involved, which might or might not be helped by 32-bit arithmetic. I did not really look into it. From my prior debugging experiences on Intel, though, I automatically looked for the least significant bytes at the beginning of those "sha1" variables, and came up empty. Ciao, Dscho -
So, I'm confused about what you actually mean by "big endian" here. I originally assumed that you meant that SHA1's are defined as bit arrays, and that the first bit of the SHA1 is in the high-order bit of the first byte. But if you just meant that the first byte of the SHA1 is stored in the first byte of the array... that kind of goes without saying, doesn't it? In any case, maybe this is a detail that's best left to the code itself. --b. -
Hi, Hm. Let me explain it in this way: If you parse a number, passed to a program, with strtol(argv[1], NULL, 0) you would expect something like this on an Intel processor: Input 0x1234 -> memory 0x34 0x12 0x00 0x00. On a big endian machine, you'd expect 0x00 0x00 0x12 0x34. That is what endianness means. If you tell Git that it should look for commit e83c6516..., it will store the sha1 as 0xe8 0x3c 0x65 0x16 ... in memory, no matter which endianness the processor has. Which was positively confusing for me, since I automatically searched for the sequence 0x90 0xf2 0x4a 0x60 ... (which is the tail of that hash). But if all this sounds too confusing, I agree to delete the "(big-endian)". Ciao, Dscho -
Right, but this is something special to integers. If it made sense for some strange reason to define the structure carrying a sha1 as int[5] instead of char[20] then I'd understand the confusion, but char[20] is Yeah, I think that'd be best; thanks. --b. -
But it would be really weird to get 0x90 0xf2 0x4a 0x60 ... 0x16 0x65 0x3c 0xe8 unless you've got a 160-bit little-endian processor. That would be as If it confused you, there should be something there. Maybe "(in order)" or something else implying that the underlying type is an octet sequence, rather than a 160-bit integer? -Daniel *This .sig left intentionally blank* -
Hi, I was not aware originally, that no arithmetic is involved in SHA-1 computation. If you store large integers, it makes tons of sense to follow the endianness, especially if you do _both_ boolean and integer operations on Well, I am convinced by now that nobody could be as stupid as me, so I think it is good without such a hint :-) Ciao, Dscho -
Actually, if you take a look at http://en.wikipedia.org/wiki/Sha1#SHA-1_algorithm you'll see that in addition to an unholy mess of bitwise operations, it does do some additions, on 32-bit big-endian words according to the article. But thinking of them as addition gives you the wrong mental picture; they're simply one of many ways for a standard processor to mix bits efficiently as far as SHA-1 is concerned. The algorithm is specified as yielding a 160-bit binary blob, and can and should be thought of as a black NSA-certified box with "warranty void if this seal is broken" stickers. (Unless you're the one implementing it, of course. But then you know what you're doing.) -- Karl Hasselstr
Those all look like sensible changes to me, thanks! --b. -
