Re: [PATCH] Add a birdview-on-the-source-code section to the user manual

Previous thread: git pull failure, truncated object by Bill Lear on Tuesday, May 8, 2007 - 7:28 am. (6 messages)

Next thread: [PATCH] Add howto files to rpm packages. by Quy Tonthat on Tuesday, May 8, 2007 - 7:19 am. (1 message)
From: Johannes Schindelin
Date: Tuesday, May 8, 2007 - 8:10 am

In http://thread.gmane.org/gmane.comp.version-control.git/42479,
a birdview on the source code was requested.

J. Bruce Fields suggested that my reply should be included in the
user manual, and there was nothing of an outcry, so here it is,
not even 2 months later.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 Documentation/user-manual.txt |  196 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 196 insertions(+), 0 deletions(-)

diff --git a/Documentation/user-manual.txt b/Documentation/user-manual.txt
index 67f5b9b..3c67b68 100644
--- a/Documentation/user-manual.txt
+++ b/Documentation/user-manual.txt
@@ -3161,6 +3161,202 @@ confusing and scary messages, but it won't actually do anything bad. In
 contrast, running "git prune" while somebody is actively changing the 
 repository is a *BAD* idea).
 
+[[birdview-on-the-source-code]]
+A birdview on Git's source code
+-----------------------------
+
+While Git's source code is quite elegant, it is not always easy for 
+new  developers to find their way through it.  A good idea is to look 
+at the contents of the initial commit: 
+_e83c5163316f89bfbde7d9ab23ca2e25604af290_ (also known as _v0.99~954_).
+
+Tip: you can see what files are in there with
+
+----------------------------------------------------
+$ git show e83c5163316f89bfbde7d9ab23ca2e25604af290:
+----------------------------------------------------
+
+and look at those files with something like
+
+-----------------------------------------------------------
+$ git show e83c5163316f89bfbde7d9ab23ca2e25604af290:cache.h
+-----------------------------------------------------------
+
+Be sure to read the README in that revision _after_ you are familiar with 
+the terminology (<<glossary>>), since the terminology has changed a little 
+since then.  For example, we call the things "commits" now, which are 
+described in that README as "changesets".
+
+Actually a lot of the structure as it is now can be explained by that ...
From: Karl
Date: Tuesday, May 8, 2007 - 2:01 pm

Either it should be "unsigned char[40]" (or possibly 41 with a
terminating \0), or else you shouldn't be talking about hexadecimal
since it's just a 20-byte big-endian unsigned integer. (A third
possibility is that I'm totally confused.)

-- 
Karl Hasselstr
From: Johannes Schindelin
Date: Tuesday, May 8, 2007 - 2:07 pm

Hi,

On Tue, 8 May 2007, Karl Hasselstr
From: Karl
Date: Tuesday, May 8, 2007 - 2:31 pm

On 2007-05-08 23:07:04 +0200, Johannes Schindelin wrote:

> On Tue, 8 May 2007, Karl Hasselstr
From: Johannes Schindelin
Date: Tuesday, May 8, 2007 - 4:10 pm

Hi,

On Tue, 8 May 2007, Karl Hasselstr
From: Karl
Date: Tuesday, May 8, 2007 - 4:22 pm

On 2007-05-09 01:10:13 +0200, Johannes Schindelin wrote:

> On Tue, 8 May 2007, Karl Hasselstr
From: Daniel Barkalow
Date: Tuesday, May 8, 2007 - 9:54 pm

SHA-1 is defined as producing a octet sequence, and to have a canonical=20
hex digit sequence conversion with the high nibbles first. Internally, it=
=20
is canonically specified using big-endian math, but the same algorithm=20
could equally be specified with little-endian math and different rules for=
=20

It's kind of important to distinguish between the hex representation and=20
the octet representation, because your code will not work at all if you=20
use the wrong one. And "unsigned char *" or "unsigned char[20]" is always=
=20
the octets; the hex is always "char *". Primarily mentioning the one that=
=20
is more intuitive but less frequently used doesn't help with understanding=
=20
the actual code.

=09-Daniel
*This .sig left intentionally blank*
From: Karl
Date: Tuesday, May 8, 2007 - 11:31 pm

uint8_t, anyone? :-)

-- 
Karl Hasselstr
From: Johannes Schindelin
Date: Wednesday, May 9, 2007 - 2:38 am

Hi,

> > On Tue, 8 May 2007, Karl Hasselstr
From: Karl
Date: Wednesday, May 9, 2007 - 3:43 am

That'll address my complaint nicely, I believe. It was the confusion
between these two formats that I was trying to get at.

-- 
Karl Hasselstr
From: J. Bruce Fields
Date: Tuesday, May 8, 2007 - 8:18 pm

Looks helpful, concise, and to the point.  Neat-o.

Acked-by: J. Bruce Fields <bfields@citi.umich.edu>


Might want to add "in a recent commit"?--it's not clear that you've

Unless the reader has already been hanging out on the mailing list a
while, "most libified" may not mean much to them yet at this point.

The organization of the next bit is slightly confusing: we're set up to
expect a longer lecture on the revision walker, but instead there's just
the historical note on git-rev-list, a mention of 'revision.c',
'revision.h', and 'struct rev_info', and then it rapidly digresses into
discussing builtins.

Which actually is fine, but just a few small markers of where we are in
the discussion might be reassuring--a section header or two, maybe a
little more emphasis on the pointers you're giving, like: "take a moment
to go read revision.h and revision.c now, paying special attention to
struct rev_info, which ....".

--b.
-

From: Junio C Hamano
Date: Tuesday, May 8, 2007 - 9:06 pm

I had the same impression.

I was meaning to write a "code walkthru for git hackers and
wannabes" with target audience quite different from the
user-manual.  My idea of which areas to cover in what order
seems to match with what Johannes started.

 - sha1_name.c;

 - read_sha1_file();

 - revision.c::setup_revisions() to talk about parsing but not
   about walking yet.

 - start from builtin-merge-base.c into commit.c to talk about
   revision traversal done by get_merge_bases().  This codepath
   is much simpler than the revision.c machinery and is a good
   primer to understand the latter.

 - builtin-diff-tree.c to show one tree and two tree cases, go
   into log-tree.c then tree-diff.c to show the use of
   add_remove() and change() callbacks, and then finally talk
   about diff_flush(), without talking about diffcore
   transformations yet.

 - start from builtin-log.c to review the setup_revisions(),
   then talk about prepare_revision_walk() and get_revision()
   machinery, first pass without talking about path limiting and
   then with path limiting.

 - fetch-pack.c and upload-pack.c to talk about the native
   protocol over ssh and local forking, how revision traversal
   machinery is used, the "objects pointed by refs are complete"
   contract.

 - daemon.c to see how upload-pack is invoked.

 - read_cache(), active_cache[], active_nr and friends;

 - update-index and write-tree, including how cache-tree
   optimizes tree writing after small updates.  Advanced students
   can also look at git-apply here.

 - unpack-trees.c and builtin-read-tree.c to talk about index stages.

 - diffcore transformations, especially diffcore-rename.

 - merge-recursive


-

From: Junio C Hamano
Date: Tuesday, May 8, 2007 - 10:05 pm

Having said that, I do not think the patch belongs to the "git
USER'S manual".  It is a very good introductory material for a
separate "git hackers manual", though.



-

From: Johannes Schindelin
Date: Wednesday, May 9, 2007 - 2:33 am

Hi,


That is what I was referring to when I mentioned "no outcry". Bruce said 
that he liked the idea to have something like that in the USER's manual. 

And I have to agree: There might be enough room to actually go and write a 
Git hacker's manual, but IMHO that takes a lot of time which has to be 
found at first.

And even if we actually have such a hacker's manual one day, this "sneak 
preview" in the user's manual does not hurt, but could actually entice 
people to read that manual, too.

Ciao,
Dscho

-

From: J. Bruce Fields
Date: Wednesday, May 9, 2007 - 10:36 am

Well, we could remove the word "user" from the name.  There's a pretty
good continuum between users and hackers, and that's as it should be.

But that would be OK too, as long as we have a clear idea how we're
going to decide what goes in which manual.  No need to wait until the
whole thing's done--we could commit an initial version of it Johannes's
work and your outline and fill in the rest later.

--b.
-

From: Karl
Date: Tuesday, May 8, 2007 - 11:48 pm

Yeah, I like it too. I forgot to say that, and got right to the
criticism instead. But better late than never, I hope. :-)

-- 
Karl Hasselstr
From: Johannes Schindelin
Date: Wednesday, May 9, 2007 - 2:27 am

Hi,



How about a sentence way before that, when I talk about the initial 
commit, like this:

	In the early days, Git (in the tradition of UNIX) was a bunch of 
	programs which were extremely simple, and which you used in scripts, 
	piping the output of one into another. This turned out to be good 
	for initial development, since it was easier to test new things. 
	However, recently many of these parts have become builtins, and 
	some of the core has been "libified", i.e. put into libgit.a for 

Okay. I hope I will be able to make these changes until tomorrow (I will 
be gone for a few days after that).

Ciao,
Dscho

-

From: Johannes Schindelin
Date: Wednesday, May 9, 2007 - 5:19 am

Hi,

for your reviewing pleasure, I made a patch on top of the original one, 
but I can easily provide a full patch for application.

--
[PATCH] user-manual: Touch ups on the birdview section

... as suggested by J. Bruce Fields, Karl Hasselström and Daniel Barkalow.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 Documentation/user-manual.txt |   31 +++++++++++++++++++++++++++----
 1 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/Documentation/user-manual.txt b/Documentation/user-manual.txt
index 2d58bb0..55934db 100644
--- a/Documentation/user-manual.txt
+++ b/Documentation/user-manual.txt
@@ -3197,7 +3197,15 @@ basically _the_ header file which is included by _all_ of Git's C sources.
 If you grasp the ideas in that initial commit (it is really small and you 
 can get into it really fast, and it will help you recognize things in the 
 much larger code base we have now), you should go on skimming `cache.h`, 
-`object.h` and `commit.h`.
+`object.h` and `commit.h` in the current version.
+
+In the early days, Git (in the tradition of UNIX) was a bunch of programs 
+which were extremely simple, and which you used in scripts, piping the 
+output of one into another. This turned out to be good for initial 
+development, since it was easier to test new things.  However, recently 
+many of these parts have become builtins, and some of the core has been 
+"libified", i.e. put into libgit.a for performance, portability reasons, 
+and to avoid code duplication.
 
 By now, you know what the index is (and find the corresponding data 
 structures in `cache.h`), and that there are just a couple of object types 
@@ -3236,9 +3244,22 @@ options that were relevant for the different plumbing commands that were
 called by the script.
 
 Most of what `git-rev-list` did is contained in `revision.c` and 
-`revision.h`.  It wraps the options in a struct named rev_info, which 
+`revision.h`.  It wraps the options in a struct named `rev_info`, which 
 ...
From: Petr Baudis
Date: Wednesday, May 9, 2007 - 5:32 am

I disagree, especially with the past tense of the first half of the
paragraph. Git is _still_ a bunch of programs you use in scripts, piping
the output of one into another. Another point is that
implementation-wise many of the code is currently shared in an internal
library, etc.

I'd be a bit careful to talk about libgit.a so leisurely since it might
give the reader an impression that there really _is_ "the git library",
with API and everything, that they can use externally. Of course you
need to mention libgit.a, but I'd also mention that it is so far meant

To be honest, I wouldn't even be *thinking* about the endianity of SHA-1
octet representation (you don't usually really deal with the hash as
with a number, so expecting to have it in native endianity is not very
natural; you just deal with it as with a data blob) and the
"(big-endian)" would only confuse me and get me thinking about "huh, do
they swap the bytes, or wait, they don't, ...?!".

But that's maybe just me.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Ever try. Ever fail. No matter. // Try again. Fail again. Fail better.
		-- Samuel Beckett
-

From: Johannes Schindelin
Date: Wednesday, May 9, 2007 - 5:50 am

Hi,


No. Many parts are _not_ simple programs piped into each other. git-log, 

Frankly, this is just a birdview thing. If you want to go and make a 

But then, maybe it is just me? I got it completely wrong the first time, 
fully expecting the calculations to be carried out in host endianness for 
performance reasons.

Ciao,
Dscho

-

From: Daniel Barkalow
Date: Wednesday, May 9, 2007 - 9:18 am

I think the Mozilla implementation carries out calculations in host 
endianness, and transfers data from the input to the internal state and 
from the internal state to the final hash with shifts and masks.

Which calculations are you seeing that involve byte order?

	-Daniel
*This .sig left intentionally blank*
-

From: Johannes Schindelin
Date: Wednesday, May 9, 2007 - 9:25 am

Hi,


None. I only suspected them to be carried out in byte order. From what I 
know, there are some shifts involved, which might or might not be helped 
by 32-bit arithmetic.

I did not really look into it.

From my prior debugging experiences on Intel, though, I automatically 
looked for the least significant bytes at the beginning of those "sha1" 
variables, and came up empty.

Ciao,
Dscho


-

From: J. Bruce Fields
Date: Wednesday, May 9, 2007 - 10:07 am

So, I'm confused about what you actually mean by "big endian" here.  I
originally assumed that you meant that SHA1's are defined as bit arrays,
and that the first bit of the SHA1 is in the high-order bit of the first
byte.  But if you just meant that the first byte of the SHA1 is stored
in the first byte of the array...  that kind of goes without saying,
doesn't it?

In any case, maybe this is a detail that's best left to the code itself.

--b.
-

From: Johannes Schindelin
Date: Wednesday, May 9, 2007 - 1:15 pm

Hi,


Hm.

Let me explain it in this way:

If you parse a number, passed to a program, with strtol(argv[1], NULL, 0) 
you would expect something like this on an Intel processor:

Input 0x1234 -> memory 0x34 0x12 0x00 0x00.

On a big endian machine, you'd expect 0x00 0x00 0x12 0x34.

That is what endianness means.

If you tell Git that it should look for commit e83c6516..., it will store 
the sha1 as 0xe8 0x3c 0x65 0x16 ... in memory, no matter which 
endianness the processor has.

Which was positively confusing for me, since I automatically searched for 
the sequence 0x90 0xf2 0x4a 0x60 ... (which is the tail of that hash).

But if all this sounds too confusing, I agree to delete the 
"(big-endian)".

Ciao,
Dscho


-

From: J. Bruce Fields
Date: Wednesday, May 9, 2007 - 1:32 pm

Right, but this is something special to integers.  If it made sense for
some strange reason to define the structure carrying a sha1 as int[5]
instead of char[20] then I'd understand the confusion, but char[20] is

Yeah, I think that'd be best; thanks.

--b.
-

From: Daniel Barkalow
Date: Wednesday, May 9, 2007 - 1:45 pm

But it would be really weird to get 0x90 0xf2 0x4a 0x60 ... 0x16 0x65 0x3c 
0xe8 unless you've got a 160-bit little-endian processor. That would be as 

If it confused you, there should be something there. Maybe "(in order)" or 
something else implying that the underlying type is an octet sequence, 
rather than a 160-bit integer?

	-Daniel
*This .sig left intentionally blank*
-

From: Johannes Schindelin
Date: Wednesday, May 9, 2007 - 3:23 pm

Hi,


I was not aware originally, that no arithmetic is involved in SHA-1 
computation.

If you store large integers, it makes tons of sense to follow the 
endianness, especially if you do _both_ boolean and integer operations on 

Well, I am convinced by now that nobody could be as stupid as me, so I 
think it is good without such a hint :-)

Ciao,
Dscho

-

From: Karl
Date: Thursday, May 10, 2007 - 1:01 pm

Actually, if you take a look at

  http://en.wikipedia.org/wiki/Sha1#SHA-1_algorithm

you'll see that in addition to an unholy mess of bitwise operations,
it does do some additions, on 32-bit big-endian words according to the
article.

But thinking of them as addition gives you the wrong mental picture;
they're simply one of many ways for a standard processor to mix bits
efficiently as far as SHA-1 is concerned. The algorithm is specified
as yielding a 160-bit binary blob, and can and should be thought of as
a black NSA-certified box with "warranty void if this seal is broken"
stickers. (Unless you're the one implementing it, of course. But then
you know what you're doing.)

-- 
Karl Hasselstr
From: J. Bruce Fields
Date: Wednesday, May 9, 2007 - 6:18 am

Those all look like sensible changes to me, thanks!

--b.
-

Previous thread: git pull failure, truncated object by Bill Lear on Tuesday, May 8, 2007 - 7:28 am. (6 messages)

Next thread: [PATCH] Add howto files to rpm packages. by Quy Tonthat on Tuesday, May 8, 2007 - 7:19 am. (1 message)