logo
Published on KernelTrap (http://kerneltrap.org)

Linux: Managing the Kernel Source With 'git'

By Jeremy
Created Apr 11 2005 - 08:57

Linus Torvalds began working on an interim solution called "git" in the absence of BitKeeper [story [1]]. A README [2] included with the source describes it as, "a stupid (but extremely fast) directory content manager. It doesn't do a whole lot, but what it _does_ do is track directory contents efficiently." The documentation goes on to describe two abstractions used by the tool, an "object database", and a "current directory cache". Objects in the object database are referred to by the SHA1 hash of their zlib compressed contents. The various supported object types include, "blobs" which are simply binary blobs of data with no added verification, "trees" which are lists of objects sorted by name, and "changesets" which provide a historical view of an object describing "how we got there, and why". The current directory cache is a binary file "which contains an efficient representation of a virtual directory content at some random time."

During the discussion regarding git and its rapid evolution, Linus explained, "in many ways you can just see git as a filesystem - it's content- addressable, and it has a notion of versioning, but I really really designed it coming at the problem from the viewpoint of a _filesystem_ person (hey, kernels is what I do), and I actually have absolutely _zero_ interest in creating a traditional SCM system." As for actual usage, Linus noted, "I think we can make the workflow look like bk, ie pretty much like "git pull" and "git push". And for well-behaved stuff (ie minimal changes to the same files on both sides) it will even be fast. I think." Read on for much of the resulting discussion [3] which provides a fuller understanding of how the evolving tool will work.


From: Linus Torvalds [email blocked]
To: Paul Mackerras [email blocked]
Subject: Re: Kernel SCM saga..
Date: 	Thu, 7 Apr 2005 08:10:21 -0700 (PDT)



On Thu, 7 Apr 2005, Paul Mackerras wrote:
> 
> Are you happy with processing patches + descriptions, one per mail?

Yes. That's going to be my interim, I was just hoping that with 2.6.12-rc2 
out the door, and us in a "calming down" period, I could afford to not 
even do that for a while.

The real problem with the email thing is that it ends up piling up: what 
BK did in this respect was that anythign that piled up in a BK repository 
ended up still being there, and a single "bk pull" got it anyway - so if 
somebody got ignored because I was busy with something else, it didn't add 
any overhead. The queue didn't get "congested".

And that's a big thing. It comes from the "Linus pulls" model where people 
just told me that they were ready, instead of the "everybody pushes to 
Linus" model, where the destination gets congested at times.

So I do not want the "send Linus email patches" (whether mboxes or a 
single patch per email) to be a very long-term strategy. We can handle it 
for a while (in particular, I'm counting on it working up to the real 
release of 2.6.12, since we _should_ be in the calm period for the next 
month anyway), but it doesn't work in the long run.

> Do you have it automated to the point where processing emailed patches
> involves little more overhead than doing a bk pull?

It's more overhead, but not a lot. Especially nice numbered sequences like
Andrew sends (where I don't have to manually try to get the dependencies
right by trying to figure them out and hope I'm right, but instead just
sort by Subject: line) is not a lot of overhead. I can process a hundred
emails almost as easily as one, as long as I trust the maintainer (which,
when it's used as a BK replacement, I obviously do).

However, the SCM's I've looked at make this hard. One of the things (the
main thing, in fact) I've been working at is to make that process really
_efficient_. If it takes half a minute to apply a patch and remember the
changeset boundary etc (and quite frankly, that's _fast_ for most SCM's
around for a project the size of Linux), then a series of 250 emails
(which is not unheard of at all when I sync with Andrew, for example)  
takes two hours. If one of the patches in the middle doesn't apply, things
are bad bad bad.

Now, BK wasn't a speed deamon either (actually, compared to everything
else, BK _is_ a speed deamon, often by one or two orders of magnitude),
and took about 10-15 seconds per email when I merged with Andrew. HOWEVER,
with BK that wasn't as big of an issue, since the BK<->BK merges were so
easy, so I never had the slow email merges with any of the other main
developers. So a patch-application-based SCM "merger" actually would need
to be _faster_ than BK is. Which is really really really hard.

So I'm writing some scripts to try to track things a whole lot faster.  
Initial indications are that I should be able to do it almost as quickly
as I can just apply the patch, but quite frankly, I'm at most half done,
and if I hit a snag maybe that's not true at all. Anyway, the reason I can
do it quickly is that my scripts will _not_ be an SCM, they'll be a very
specific "log Linus' state" kind of thing. That will make the linear patch
merge a lot more time-efficient, and thus possible.

(If a patch apply takes three seconds, even a big series of patches is not
a problem: if I get notified within a minute or two that it failed
half-way, that's fine, I can then just fix it up manually. That's why 
latency is critical - if I'd have to do things effectively "offline", 
I'd by definition not be able to fix it up when problems happen).

> If so, then your mailbox (or patch queue) becomes a natural
> serialization point for the changes, and the need for a tool that can
> handle a complex graph of changes is much reduced.

Yes. In the short term. See above why I think the congestion issue will 
really mean that we want to have parallell merging in the not _too_ 
distant future.

NOTE! I detest the centralized SCM model, but if push comes to shove, and
we just _can't_ get a reasonable parallell merge thing going in the short
timeframe (ie month or two), I'll use something like SVN on a trusted site
with just a few committers, and at least try to distribute the merging out
over a few people rather than making _me_ be the throttle.

The reason I don't really want to do that is once we start doing it that
way, I suspect we'll have a _really_ hard time stopping. I think it's a
broken model. So I'd much rather try to have some pain in the short run 
and get a better model running, but I just wanted to let people know that 
I'm pragmatic enough that I realize that we may not have much choice.

> * Visibility into what you had accepted and committed to your
>   repository
> * Lower latency of patches going into your repository
> * Much reduced rate of patches being dropped

Yes. 

		Linus


From: Linus Torvalds [email blocked] Subject: Re: Kernel SCM saga.. Date: Thu, 7 Apr 2005 21:42:04 -0700 (PDT) On Thu, 7 Apr 2005, Chris Wedgwood wrote: > > I'm playing with monotone right now. Superficially it looks like it > has tons of gee-whiz neato stuff... however, it's *agonizingly* slow. > I mean glacial. A heavily sedated sloth with no legs is probably > faster. Yes. The silly thing is, at least in my local tests it doesn't actually seem to be _doing_ anything while it's slow (there are no system calls except for a few memory allocations and de-allocations). It seems to have some exponential function on the number of pathnames involved etc. I'm hoping they can fix it, though. The basic notions do not sound wrong. In the meantime (and because monotone really _is_ that slow), here's a quick challenge for you, and any crazy hacker out there: if you want to play with something _really_ nasty (but also very _very_ fast), take a look at kernel.org:/pub/linux/kernel/people/torvalds/. First one to send me the changelog tree of sparse-git (and a tool to commit and push/pull further changes) gets a gold star, and an honorable mention. I've put a hell of a lot of clues in there (*). I've worked on it (and little else) for the last two days. Time for somebody else to tell me I'm crazy. Linus (*) It should be easier than it sounds. The database is designed so that you can do the equivalent of a nonmerging (ie pure superset) push/pull with just plain rsync, so replication really should be that easy (if somewhat bandwidth-intensive due to the whole-file format). Never mind merging. It's not an SCM, it's a distribution and archival mechanism. I bet you could make a reasonable SCM on top of it, though. Another way of looking at it is to say that it's really a content- addressable filesystem, used to track directory trees.
From: Linus Torvalds [email blocked] Subject: Re: Kernel SCM saga.. Date: Fri, 8 Apr 2005 11:47:10 -0700 (PDT) On Fri, 8 Apr 2005, Jeff Garzik wrote: > > Well... it took me over 30 seconds just to 'rm -rf' the unpacked > tarballs of git and sparse-git, over my LAN's NFS. Don't use NFS for development. It sucks for BK too. That said, normal _use_ should actually be pretty efficient even over NFS. It will "stat" a hell of a lot of files to do the "show-diff", but that part you really can't avoid unless you depend on all the tools marking their changes somewhere. Which BK does, actually, but that was pretty painful, and means that bk needed to re-implement all the normal ops (ie "bk patch"). What's also nice is that exactly because "git" depends on totally immutable files, they actually cache very well over NFS. Even if you were to share a database across machines (which is _not_ what git is meant to do, but it's certainly possible). So I actually suspect that if you actually _work_ with a tree in "git", you will find performance very good indeed. The fact that it creates a number of files when you pull in a new repository is a different thing. > Granted that this sort of stuff works well with (a) rsync and (b) > hardlinks, but it's still punishment on the i/dcache. Actually, it's not. Not once it is set up. Exactly because "git" doesn't actually access those files unless it literally needs the data in one file, and then it's always set up so that it needs either none or _all_ of the file. There is no data sharing anywhere, so you are never in the situation where it needs "ten bytes from file X" and "25 bytes from file Y". For example, if you don't have any changes in your tree, there is exactly _one_ file that a "show-diff" will read: the .dircache/index file. That's it. After that, it will "stat()" exactly the files you are tracking, and nothing more. It will not touch any internal "git" data AT ALL. That "stat" will be somewhat expensive unless your client caches stat data too, but that's it. And if it turns out that you have changed a file (or even just touched it, so that the data is the same, but the index file can no longer guarantee it with just a single "stat()"), then git will open have exactly _one_ file (no searching, no messing around), which contains absolutely nothing except for the compressed (and SHA1-signed) old contents of the file. It obviously _has_ to do that, because in order to know whether you've changed it, it needs to now compare it to the original. IOW, "git" will literally touch the minimum IO necessary, and absolutely minimum cache-footprint. The fact is, when tracking the 17,000 files in the kernel directory, most of them are never actually changed. They literally are "free". They aren't brought into the cache by "git" - not the file itself, not the backing store. You set up the index file once, and you never ever touch them again. You could put the sha1 files on a tape, for all git cares. The one exception obviously being when you actually instantiate the git archive for the first time (or when you throw it away). At that time you do touch all of the data, but that should be the only time. THAT is what git is good at. It optimized for the "not a lot of changes" things, and pretty much all the operations are O(n) in the "size of change", not in "size of repo". That includes even things like "give me the diff between the top of tree and the tree 10 days ago". If you know what your head was 10 days ago, "git" will open exactly _four_ small files for this operation (the current "top" commit, the commit file of ten days ago, and the two "tree" files associated with those). It will then need to open the backing store files for the files that are different between the two versions, but IT WILL NEVER EVEN LOOK at the files that it immediately sees are the same. And that's actually true whether we're talking about the top-of-tree or not. If I had the kernel history in git format (I don't - I estimate that it would be about 1.5GB - 2GB in size, and would take me about ten days to extract from BK ;), I could do a diff between _any_ tagged version (and I mention "tagged" only as a way to look up the commit ID - it doesn't have to be tagged if you know it some other way) in O(n) where 'n' is the number of files that have changed between the revisions. Number of changesets doesn't matter. Number of files doesn't matter. The _only_ thing that matters is the size of the change. Btw, I don't actually have a git command to do this yet. A bit of scripting required to do it, but it's pretty trivial: you open the two "commit" files that are the beginning/end of the thing, you look up what the tree state was at each point, you open up the two tree files involved, and you ignore all entries that match. Since the tree files are already sorted, that "ignoring matches" is basically free (technically that's O(n) in the number of files described, but we're talking about something that even a slow machine can do so fast you probably can't even time it with a stop-watch). You now have the complete list of files that have been changed (removed, added or "exists in both trees, but different contents"), and you can thus trivially create the whole tree with opening up _only_ the indexes for those files. Ergo: O(n) in size of change. Both in work and in disk/cache access (where the latter is often the more important one). Absolutely _zero_ indirection anywhere apart from the initial stage to go from "commit" to "tree", so there's no seeking except to actually read the files once you know what they are (and since you know them up-front and there are no dependencies at that point, you could even tell the OS to prefetch them if you really cared about getting minimal disk seeks). Linus
From: Linus Torvalds [email blocked] Subject: more git updates.. Date: Sat, 9 Apr 2005 12:45:52 -0700 (PDT) Sorry guys, several of you have sent me small fixes and scripts to "git", but I've been busy on breaking/changing the core infrastructure, so I didn't get around to looking at the scripts yet. The good news is, the data structures/indexes haven't changed, but many of the tools to interface with them have new (and improved!) semantics: In particular, I changed how "read-tree" works, so that it now mirrors "write-tree", in that instead of actually changing the working directory, it only updates the index file (aka "current directory cache" file from the tree). To actually change the working directory, you'd first get the index file setup, and then you do a "checkout-cache -a" to update the files in your working directory with the files from the sha1 database. Also, I wrote the "diff-tree" thing I talked about: torvalds@ppc970:~/git> ./diff-tree \ 8fd07d4b7778cd0233ea0a17acd3fe9d710af035 8c6d29d6a496d12f1c224db945c0c56fd60ce941 \ | tr '\0' '\n' <100664 4870bcf91f8666fc788b07578fb7473eda795587 Makefile >100664 5493a649bb33b9264e8ed26cc1f832989a307d3b Makefile <100664 9e1bee21e17c134a2fb008db62679048fc819528 cache.h >100664 56ef561e590fd99e938bd47fd1f2c7ed46126ff0 cache.h <100664 fd690acc02ef9c06d7c4c3541f69b10ca4b4f8c9 cat-file.c >100664 6e6d89291ced17a406e64b97fe8bb96a22eefc9d cat-file.c +100664 fd00e5603dcc4a93acceda0b8cb914fabc8645d5 checkout-cache.c <100664 a4a8c3d9ef0c4cc6c82b96b5d1a91ac6d3bed466 commit-tree.c >100664 236ceb7646e3f5d110fd83f815b82e94cc5b2927 commit-tree.c +100664 01c92f2620a8e13e7cb7fd98ee644c6b65eeccb7 fsck-cache.c <100664 0eaa053919e0cc400ab9bc40d9272360117e6978 init-db.c >100664 815743e92dad7e451c65bab01448ee8ae9deeb56 init-db.c <100664 e7bfaadd5d2331123663a8f14a26604a3cdcb678 read-cache.c >100664 71d0cb6fe9b7ff79e3b2c5a61e288ac9f62b39dc read-cache.c <100664 ec0f167a6a505659e5af6911c97f465506534c34 read-tree.c >100664 f5c50ba79d02f002b9675fd8f129fa388e3282c6 read-tree.c <100664 00a29c403e751c2a2a61eb24fa2249c8956d1c80 show-diff.c >100664 b963dd738989bc92bf02352bbedad13a74e66a7d show-diff.c <100664 aff074c63ac827801a7d02ff92781365957f1430 update-cache.c >100664 3a672397164d5ff27a19a6888b578af96824ede7 update-cache.c <100664 7abeeba116b2b251c12ae32c7b38cb048199b574 write-tree.c >100664 9525c6fc975888a394477339db86216cd5bd5d7c write-tree.c (ie the output of "diff-tree" has the same NUL-termination, but if you insist on getting ASCII output, you can just use "tr" to change the NUL into a NL). The format of the "diff-tree" output is that the first character is "-" for "remove file", "+" for "add file" and "<"/">" for "change file" (where the "<" shows the old state, and ">" shows the new state). Btw, the NUL-termination makes this really easy to use even in shell scripts, ie you can do diff-tree <sha1> <sha1> | xargs -0 do_something and you'll get each line as one nice argument to your "do_something" script. So a do_diff could be based on something like #!/bin/sh while [ "$1" != "" ]; do filename="$(echo $1 | cut -d' ' -f3-)" first_sha="$(echo $1 | cut -d' ' -f2)" second_sha="$(echo $2 | cut -d' ' -f2)" c="$(echo $1 | cut -c1)" case "$c" in "+") echo diff -u /dev/null "$filename($first_sha)";; "-") echo diff -u "$filename($first_sha)" /dev/null;; "<") echo diff -u "$filename($first_sha)" "$filename($second_sha)" shift;; *) echo WHAT? exit 1;; esac shift done which really shows what a horrid shell-person I am (I still use the old tools I learnt to use fifteen years ago. I bet you can do it trivially in perl or something sane, and I'm just stuck in the stone age of UNIX). That makes it _very_ easy to parse. The example above is the diff between the initial commit and one of the more recent trees, so it has changes to everything, but a more normal thing would be torvalds@ppc970:~/git> diff-tree \ 787763499dc4f8cc345bc6ed8ee1e0ae31adedd6 5b0c2695634b5bab2f5d63c7bb30f7e5815af470 \ | tr '\0' '\n' <100664 01c92f2620a8e13e7cb7fd98ee644c6b65eeccb7 fsck-cache.c >100664 81aa7bee003264ea302db835158e725eefa4012d fsck-cache.c which tells you that the last commit changed just one file (it's from this one: torvalds@ppc970:~/git> cat-file commit `cat .dircache/HEAD` tree 5b0c2695634b5bab2f5d63c7bb30f7e5815af470 parent 81c53a1d3551f358860731481bb2d87179d221e6 author Linus Torvalds [email blocked] Sat Apr 9 12:02:30 2005 committer Linus Torvalds [email blocked] Sat Apr 9 12:02:30 2005 Make "fsck-cache" print out all the root commits it finds. Once I do the reference tracking, I'll also make it print out all the HEAD commits it finds, which is even more interesting. in case you care). I've rsync'ed the new git repository to kernel.org, it should all be there in /pub/linux/kernel/people/torvalds/git.git/ (and it looks like the mirror scripts already picked it up on the public side too). Can you guys re-send the scripts you wrote? They probably need some updating for the new semantics. Sorry about that ;( Linus
From: Linus Torvalds [email blocked] Subject: Re: more git updates.. Date: Sat, 9 Apr 2005 12:56:16 -0700 (PDT) On Sat, 9 Apr 2005, Linus Torvalds wrote: > > To actually change the working directory, you'd first get the index file > setup, and then you do a "checkout-cache -a" to update the files in your > working directory with the files from the sha1 database. Btw, this will not overwrite any old files, so if you have an old version of something, you'd need to do "checkout-cache -f -a" (and order matters: the "-f" must come first). This time I actually have a big comment at the top of the checkout-cache.c file trying to explain the logic. Linus
From: Petr Baudis [email blocked] Subject: Re: more git updates.. Date: Sat, 9 Apr 2005 22:07:09 +0200 Hello, Dear diary, on Sat, Apr 09, 2005 at 09:45:52PM CEST, I got a letter where Linus Torvalds [email blocked] told me that... > The good news is, the data structures/indexes haven't changed, but many of > the tools to interface with them have new (and improved!) semantics: > > In particular, I changed how "read-tree" works, so that it now mirrors > "write-tree", in that instead of actually changing the working directory, > it only updates the index file (aka "current directory cache" file from > the tree). > > To actually change the working directory, you'd first get the index file > setup, and then you do a "checkout-cache -a" to update the files in your > working directory with the files from the sha1 database. that's great. I was planning to do something with this since currently it really annoyed me. I think I will like this, even though I didn't look at the code itself yet (just on my way). > Also, I wrote the "diff-tree" thing I talked about: ..snip.. Hmm, I wonder, is this better done in C instead of a simple shell script, like my gitdiff.sh? I'd say it is more flexible and probably hardly performance-critical to have this scripted, and not difficult at all provided you have ls-tree. But maybe I'm just too fond of my script... ;-) (Ok, there's some trouble when you want to have newlines and spaces in file names, and join appears to be awfully ignorant about this... :[ ) BTW, do we care about changed modes? If so, they should probably have their place in the diff-tree output. BTW#2, I hope you will merge my ls-tree anyway, even though there is no user for it currently... I should quickly figure out some. :-) > Can you guys re-send the scripts you wrote? They probably need some > updating for the new semantics. Sorry about that ;( I'll try to merge ASAP. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ [4] 98% of the time I am right. Why worry about the other 3%.
From: Linus Torvalds [email blocked] Subject: Re: more git updates.. Date: Sat, 9 Apr 2005 14:00:09 -0700 (PDT) On Sat, 9 Apr 2005, Petr Baudis wrote: > > > Also, I wrote the "diff-tree" thing I talked about: > ..snip.. > > Hmm, I wonder, is this better done in C instead of a simple shell > script, like my gitdiff.sh? With 17,000 files in the kernel, and most commits just changing a small number of them, I actually think "diff-tree" matters. You use "join" (which is quite reasonable), but let's put it this way: just the list of files in the current kernel is about half a megabyte of data. Ie your temporary files that you use in the "ls-tree + ls-tree + join" is actually going to be quite sizeable. My goal here is that the speed of "git" really should be almost totally independent of the size of the project. You clearly cannot avoid _some_ size-dependency: my "diff-tree" clearly also has to work through the same 1MB of data, but I think it's worth making the constant factor be as small as humanly possible. I just tried checking in a kernel tree tar-file, and the initial checkin (which is allt he compression and the sha1 calculations for every single file) took about 1:35 (minutes, not hours ;). Doing a commit (trivial change to the top-level Makefile) and then doing a "treediff" between those two things took 0.05 seconds using my C thing. Ie we're talking so fast that we really don't care. Doing a "show-diff" takes 0.15 secs or so (that's all the "stat" calls), and now that I test it out I realize that the most expensive operation is actually _writing_ the "index" file out. These are the two most expensive steps: torvalds@ppc970:~/lx-test/linux-2.6.12-rc2> time update-cache Makefile real 0m0.283s user 0m0.171s sys 0m0.113s torvalds@ppc970:~/lx-test/linux-2.6.12-rc2> time write-tree 5ca21c9d808fa4bee1eb6948a59dfb9c7d73f36a real 0m0.441s user 0m0.354s sys 0m0.087s ie with the current infrastructure it looks like I can do a "patch + commit" in less than one second on the kernel, and 0.75 secs of that is because the "tree" file actually grows pretty large: cat-file tree 5ca21c9d808fa4bee1eb6948a59dfb9c7d73f36a | wc -c says that the uncompressed tree-file is 950,874 bytes. Compressing it means that the archival version of it is "just" 462,546 bytes, but this is really the part that is going to eat _tons_ of disk-space. In other words, each "commit" file is very small and cheap, but since almost every commit will also imply a totally new tree-file, "git" is going to have an overhead of half a megabyte per commit. Oops. Damn, that's painful. I suspect I will have to change the format somehow. One option (which I haven't tested yet) is that since the tree-file is already sorted, I could always write it out with the common subdirectory part "collapsed", ie instead of writing ... include/asm-i386/mach-default/bios_ebda.h include/asm-i386/mach-default/do_timer.h ... I'd write just ... ///bios_ebda.h ///do_timer.h ... since the directory names are implied by the predecessor. However, that doesn't help with the 20-byte sha1 associated with each file, which is also obviously uncompressible, so with 17,000+ files, we have a minimum overhead of abotu 350kB per tree-file. So even if I did the pathname compression, it wouldn't help all that much. I'd only be removing the only part of the file that _is_ very compressible, and I'd probably end up with something that isn't all that far away from the 450kB+ it is now. I suspect that I have to change the file format. Maybe make the "tree" object a two-level thing, and have a "directory" object. Then a "tree" object would point to a "directory" object, which would in turn point to the individual files (and other "directory" objects, of course). That way a commit that only changes a few files will only need to create a few new "directory" objects, instead of creating one huge "tree" object. Sadly, that will make "tree-diff" potentially more expensive. On the other hand, maybe not: it will also speed it _up_, since directories that are totally shared will be trivially seen as such and need no further operation. Thougths? That would break the current repository formats, and I'd have to create a converter thing (which shouldn't be that bad, of course). I don't have to do it right now. In fact, I'd almost prefer for the current thing to become good enough that it's not painful to work with, since right now I'm using it to develop itself. Then I can convert the format with an automated script later, before I actually start working on the kernel... > BTW, do we care about changed modes? If so, they should probably have > their place in the diff-tree output. They're there. If you want to ignore them, you can just notice that the sha1 matches between two lines, and then you don't even have to diff them. Linus
From: tony [email blocked] Subject: Re: more git updates.. Date: Sat, 9 Apr 2005 14:00:09 -0700 (PDT) >In other words, each "commit" file is very small and cheap, but since >almost every commit will also imply a totally new tree-file, "git" is >going to have an overhead of half a megabyte per commit. Oops. > >Damn, that's painful. I suspect I will have to change the format somehow. Having dodged that bullet with the change to make tree files point at other tree files ... here's another (potential) issue. A changeset that touches just one file a few levels down from the top of the tree (say arch/i386/kernel/setup.c) will make six new files in the git repository (one for the changeset, four tree files, and a new blob for the new version of the file). More complex changes make more files ... but say the average is ten new files per changeset since most changes touch few files. With 60,000 changesets in the current tree, we will start out our git repository with about 600,000 files. Assuming the first byte of the SHA1 hash is random, that means an average of 2343 files in each of the objects/xx directories. Give it a few more years at the current pace, and we'll have over 10,000 files per directory. This sounds like a lot to me ... but perhaps filesystems now handle large directories enough better than they used to for this to not be a problem? Or maybe the files should be named objects/xx/yy/zzzzzzzzzzzzzzzz? -Tony
From: Linus Torvalds [email blocked] Subject: Re: more git updates.. Date: Sat, 9 Apr 2005 14:08:19 -0700 (PDT) On Sat, 9 Apr 2005, Linus Torvalds wrote: > > I suspect that I have to change the file format. Maybe make the "tree" > object a two-level thing, and have a "directory" object. > > Then a "tree" object would point to a "directory" object, which would in > turn point to the individual files (and other "directory" objects, of > course). That way a commit that only changes a few files will only need to > create a few new "directory" objects, instead of creating one huge "tree" > object. Actually, I guess I wouldn't have to change the format. I could just extend the existing "tree" object to be able to point to other trees, and that's it. The downside of that is that then a tree wouldn't have a canonical format any more: you could have two trees that have the exact same content, but they'd haev different names. They should obviously merge very easily (and thus you could create a new merge that _does_ have a common name), but it's ugly. I'll have to think about it. It's good to notice these issues early, this was the first time I had actually tried to check in a kernel-sized tree for real. Linus
From: Linus Torvalds [email blocked] Subject: Re: more git updates.. Date: Sat, 9 Apr 2005 16:31:10 -0700 (PDT) On Sat, 9 Apr 2005, Linus Torvalds wrote: > > Actually, I guess I wouldn't have to change the format. I could just > extend the existing "tree" object to be able to point to other trees, and > that's it. Done, and pushed out. The current git.git repository seems to do all of this correctly. NOTE! This means that each "tree" file basically tracks just a single directory. The old style of "every file in one tree file" still works, but fsck-cache will warn about it. Happily, the git archive itself doesn't have any subdirectories, so git itself is not impacted by it. Now, this means that I should add a "recusive" option to "tree-diff", but I haven't done so yet. So right now if I change the top-level Makefile, _and_ change kernel/exit.c, then the "tree diff" between the two commit trees ends up looking like: torvalds@ppc970:~/lx-test/linux-2.6.12-rc2> diff-tree \ 7bec1223736d7e02c755e9a365984b3cbfa1e6e9 d64817f809a60cd960d3078ae91b4d19cb649501 \ | tr '\0' '\n' <100644 e1e7f7430c0297f22042cff58da5ca73ef121b95 Makefile >100644 8ee21134577e98fb642dffc5b797a0121645c543 Makefile <40000 2239383d00ae746f5e79ceccf8ac3fbca62f949d kernel >40000 a8fad219cb78a6b6a05a10f8643d615fefc8160f kernel ie it shows that the Makefile blob has changed, and the kernel directory has changed. You then need to recurse into the kernel tree to see what the changes were there: torvalds@ppc970:~/lx-test/linux-2.6.12-rc2> diff-tree \ 2239383d00ae746f5e79ceccf8ac3fbca62f949d a8fad219cb78a6b6a05a10f8643d615fefc8160f \ | tr '\0' '\n' <100644 1a50b58453679b6fee8de4f744f4befc39397bb1 exit.c >100644 e8df1325bf25816827a1a64404ad533a97bfdae2 exit.c but it clearly all seems to work. And it means that a subdirectory that didn't change at all (the common case) will be able to re-use the old sha1 file when you create a tree (this may in fact make "diff-tree" much less important, since now it tends to handle objects that are just a few kB in size, rather than almost a megabyte. So in this case, the "commit cost" of changing two files was two small tree files (1468 and 679 bytes respectively for the kernel/ and top-level directory) and the commit file itself (251 bytes). In addition to the actual data files that were changed, of course. Goodie. Big difference between that and the 460kB of the old monolithic tree file. Linus
From: Petr Baudis [email blocked] Subject: Re: Re: more git updates.. Date: Sun, 10 Apr 2005 04:41:57 +0200 Dear diary, on Sun, Apr 10, 2005 at 01:31:10AM CEST, I got a letter where Linus Torvalds [email blocked] told me that... > On Sat, 9 Apr 2005, Linus Torvalds wrote: > > > > Actually, I guess I wouldn't have to change the format. I could just > > extend the existing "tree" object to be able to point to other trees, and > > that's it. > > Done, and pushed out. The current git.git repository seems to do all of > this correctly. ..snip.. Ok, so now I can dare announce it, I hope. I hacked my branch of git somewhat, kept in sync with Linus, and now I have something to show. Please see it at http://pasky.or.cz/~pasky/dev/git/ [5] It is basically a set of (still rather crude) shell scripts upon Linus' git, which make it sanely usable by mere humans for actual version tracking. Its usage _is_ going to change, so don't get too used to it (that'd be hard anyway, I suspect), but it should be working nicely. I have described most of the interesting parts and some basic usage in the README at that page. It wraps commits, supports log retrieval and comfortable diffing between any two trees. And on top of that, it can do some basic remote repositories - it will pull (rsync) from them and it can make the local copy track them - on pull, it will be updated accordingly (and your local commits on the tracked branch will get orphaned). I didn't attach a patch against Linus since I think it's pretty much useless now. It's available as against-linus.patch on the web, and you can apply it to the latest git tree (NOT 0.03). But it's probably better idea to wget my tree. You can then watch us making progress by gitpull.sh linus gitpull.sh pasky and see where we differ by: gitdiff.sh linus pasky (This is how the against-linus.patch was generated. I'd easily generate even 0.03 patch this way, but I forgot to merge the fsck at that time, so it would suck.) (Note that the tree you wget is set up to track my branch. If you want to stop tracking it (basically necessary now if you want to do local commits), do: cp .dircache/HEAD .dircache/HEAD.local gittrack.sh The cp says that something like "I want to pick up where the tracked branch left off". Otherwise, untracking would return you to your "local" branch, which is just some ancient predecessor of the pasky branch here anyway.) Note that I didn't really test it on anything but git itself yet, so I'm not sure how will it cope especially with directories - I tried to make it aware of them though. I will do some more practical testing tomorrow. Otherwise, I will probably try to consolidate the usage and documentation now, and beautify the scripts. I might start pondering some merging too. Oh, and gitpatch.sh. :-) Have fun and please share your opinions, -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ [6] 98% of the time I am right. Why worry about the other 3%.
From: Petr Baudis [email blocked] To: Kernel Mailing List [email blocked] Subject: [ANNOUNCE] git-pasky-0.1 Date: Sun, 10 Apr 2005 18:27:23 +0200 Hello, so I "released" git-pasky-0.1, my set of patches and scripts upon Linus' git, aimed at human usability and to an extent a SCM-like usage. You can get it at http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2 [7] and after unpacking and building (make) do git pull pasky to get the latest changes from my branch. If you already have some git from my branch which can do pulling, you can bring yourself up to date by doing just gitpull.sh pasky (but this style of usage is deprecated now). Please see the README for some details regarding usage etc. You can find the changes from the last announcement in the ChangeLog (the previous announcement corresponds to commit id 5125d089ad862f16a306b4942155092e1dce1c2d). The most important change is probably recursive diff addition, and making git ignore the nsec of ctime and mtime, since it is totally unreliable and likes to taint random files as modified. My near future plans include especially some merge support; I think it should be rather easy, actually. I'll also add some simple tagging mechanism. I've decided to postpone the file moving detection, since there's no big demand for it now. ;-) I will also need to do more testing on the linux kernel tree. Committing patch-2.6.7 on 2.6.6 kernel and then diffing results in $ time gitdiff.sh `parent-id` `tree-id` >p real 5m37.434s user 1m27.113s sys 2m41.036s which is pretty horrible, it seems to me. Any benchmarking help is of course welcomed, as well as any other feedback. BTW, what would be the best (most complete) source for the BK tree metadata? Should I dig it from the BKCVS gateway, or is there a better source? Where did you get the sparse git database from, Linus? (BTW, it would be nice to get sparse.git with the directories as separate.) Have fun, -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ [8] 98% of the time I am right. Why worry about the other 3%.
From: Linus Torvalds [email blocked] Subject: Re: [ANNOUNCE] git-pasky-0.1 Date: Sun, 10 Apr 2005 09:55:17 -0700 (PDT) On Sun, 10 Apr 2005, Petr Baudis wrote: > > Where did you get the sparse git database from, Linus? (BTW, it > would be nice to get sparse.git with the directories as separate.) When we were trying to figure out how to avert the BK disaster, and one of Tridges concerns (and, in my opinion, the only really valid one) was that you couldn't get the BK data in some SCM-independent way. So I wrote some very preliminary scripts (on top of BK itself) to extract the data, to show that BK could generate a SCM-neutral file format (a very stupid one and horribly useless for anything but interoperability, but still...). I was hoping that that would convince Tridge that trying to muck around with the internal BK file format was not worth it, and avert the BK trainwreck. Larry was ok with the idea to make my export format actually be natively supported by BK (ie the same way you have "bk export -tpatch"), but Tridge wanted to instead get at the native data and be difficult about it. As a result, I can now not only use BK any more, but we also don't have a nice export format from BK. Yeah, I'm a bit bitter about it. Anyway, the sparse data came out of my hack. It's very inefficient, and I estimated that doing the same for the kernel would have taken ten solid days of conversion, mainly because my hack was really just that: a quick hack to show that BK could do it. Larry could have done it a lot better. I'll re-generate the sparse git-database at some point (and I'll probably do so from the old GIT database itself, rather than re-generating it from my old BK data). Linus
From: Linus Torvalds [email blocked] Subject: Re: more git updates.. Date: Sun, 10 Apr 2005 08:44:56 -0700 (PDT) On Sun, 10 Apr 2005, Junio C Hamano wrote: > > But I am wondering what your plans are to handle renames---or > does git already represent them? You can represent renames on top of git - git itself really doesn't care. In many ways you can just see git as a filesystem - it's content- addressable, and it has a notion of versioning, but I really really designed it coming at the problem from the viewpoint of a _filesystem_ person (hey, kernels is what I do), and I actually have absolutely _zero_ interest in creating a traditional SCM system. So to take renaming a file as an example - why do you actually want to track renames? In traditional SCM's, you do it for two reasons: - space efficiency. Most SCM's are based on describing changes to a file, and compress the data by doing revisions on the same file. In order to continue that process past a rename, such an SCM _has_ to track renames, or lose the delta-based approach. The most trivial example of this is "diff", ie a rename ends up generating a _huge_ diff unless you track the rename explicitly. GIT doesn't care. There is _zero_ space efficiency in trying to track renames. In fact, it would add overhead to the system, not lessen it. That's because GIT fundamentally doesn't do the "delta-within-a-file" model. - annotate/blame. This is a valid concern, but the fact is, I never use it. It may be a deficiency of mine, but I simply don't do the per-line thing when I debug or try to find who was responsible. I do "blame" on a much bigger-picture level, and I personally believe (pretty strongly) that per-line annotations are not actually a good thing - they come not because people _want_ to do things at that low level, but because historically, you didn't _have_ the bigger-picture thing. In other words, pretty much every SCM out there is based on SCCS "mentally", even if not in any other model. That's why people think per-line blame is important - you have that mental model. So consider me deficient, or consider me radical. It boils down to the same thing. Renames don't matter. That said, if somebody wants to create a _real_ SCM (rather than my notion of a pure content tracker) on top of GIT, you probably could fairly easily do so by imposing a few limitations on a higher level. For example, most SCM's that track renames require that the user _tell_ them about the renames: you do a "bk mv" or a "svn rename" or something. If you want to do the same on top of GIT, then you should think of GIT as what it is: GIT just tracks contents. It's a filesystem - although a fairly strange one. How would you track renames on top of that? Easy: add your own fields to the GIT revision messages: GIT enforces the header, but you can add anything you want to the "free-form" part that follows it. Same goes for any other information where you care about what happens "within" a file. GIT simply doesn't track it. You can build things on top of GIT if you want to, though. They may not be as efficient as they would be if they were built _into_ GIT, but on the other hand GIT does a lot of other things a hell of a lot faster thanks to it's design. So whether you agree with the things that _I_ consider important probably depends on how you work. The real downside of GIT may be that _my_ way of doing things is quite possibly very rare. But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong. Linus
From: Paul Jackson [email blocked] Subject: Re: more git updates.. Date: Sun, 10 Apr 2005 11:50:55 -0700 Linus wrote: > It's a filesystem - although a > fairly strange one. Ah ha - that explains the read-tree and write-tree names. The read-tree pulls stuff out of this file system into your working files, clobbering local edits. This is like the read(2) system call, which clobbers stuff in your read buffer. The write-tree pushes stuff down into the file system, just like write(2) pushes data into the kernel. I was getting all kind of frustrated yesterday trying to use Linus's git commands, coming at these names with my SCM hat on. That way of thinking really doesn't work well here. I will have to look more closely at pasky's GIT toolkit if I want to see an SCM style interface. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [email blocked] 1.650.933.1373, 1.925.600.0401
From: Linus Torvalds [email blocked] Subject: Re: more git updates.. Date: Sun, 10 Apr 2005 13:57:33 -0700 (PDT) On Sun, 10 Apr 2005, Paul Jackson wrote: > > Ah ha - that explains the read-tree and write-tree names. > > The read-tree pulls stuff out of this file system into > your working files, clobbering local edits. This is like > the read(2) system call, which clobbers stuff in your > read buffer. Yes. Except it's a two-stage thing, where the staging area is always the "current directory cache". So a "read-tree" always reads the tree information into the directory cache, but does not actually _update_ any of the files it "caches". To do that, you need to do a "checkout-cache" phase. Similarly, "write-tree" writes the current directory cache contents into a set of tree files. But in order to have that match what is actually in your directory right now, you need to have done a "update-cache" phase before you did the "write-tree". So there is always a staging area between the "real contents" and the "written tree". > That way of thinking really doesn't work well here. > > I will have to look more closely at pasky's GIT toolkit > if I want to see an SCM style interface. Yes. You really should think of GIT as a filesystem, and of me as a _systems_ person, not an SCM person. In fact, I tend to detest SCM's. I think the reason I worked so well with BitKeeper is that Larry used to do operating systems. He's also a systems person, not really an SCM person. Or at least he's in between the two. My operations are like the "system calls". Useless on their own: they're not real applications, they're just how you read and write files in this really strange filesystem. You need to wrap them up to make them do anything sane. For example, take "commit-tree" - it really just says that "this is the new tree, and these other trees were its parents". It doesn't do any of the actual work to _get_ those trees written. So to actually do the high-level operation of a real commit, you need to first update the current directory cache to match what you want to commit (the "update-cache" phase). Then, when your directory cache matches what you want to commit (which is NOT necessarily the same thing as your actual current working area - if you don't want to commit some of the changes you have in your tree, you should avoid updating the cache with those changes), you do stage 2, ie "write-tree". That writes a tree node that describes what you want to commit. Only THEN, as phase three, do you do the "commit-tree". Now you give it the tree you want to commit (remember - that may not even match your current directory contents), and the history of how you got here (ie you tell commit what the previous commit(s) were), and the changelog. So a "commit" in SCM-speak is actually three totally separate phases in my filesystem thing, and each of the phases (except for the last "commit-tree" which is the thing that brings it all together) is actually in turn many smaller parts (ie "update-cache" may have been called hundreds of times, and "write-tree" will write several tree objects that point to each other). Similarly, a "checkout" really is about first finding the tree ID you want to check out, and then bringing it into the "directory cache" by doing a "read-tree" on it. You can then actually update the directory cache further: you might "read-tree" _another_ project, or you could decide that you want to keep one of the files you already had. So in that scneario, after doing the read-tree you'd do an "update-cache" on the file you want to keep in your current directory structure, which updates your directory cache to be a _mix_ of the original tree you now want to check out _and_ of the file you want to use from your current directory. Then doing a "checkout-cache -a" will actually do the actual checkout, and only at that point does your working directory really get changed. Btw, you don't even have to have any working directory files at all. Let's say that you have two independent trees, and you want to create a new commit that is the join of those two trees (where one of the trees take precedence). You'd do a "read-tree <a> <b>", which will create a directory cache (but not check out) that is the union of the <a> and <b> trees (<b> will overrride). And then you can do a "write-tree" and commit the resulting tree - without ever having _any_ of those files checked out. Linus
From: Linus Torvalds [email blocked] To: Jeff Garzik [email blocked] Subject: Re: New SCM and commit list Date: Sun, 10 Apr 2005 23:15:20 -0700 (PDT) On Mon, 11 Apr 2005, Jeff Garzik wrote: > > > But I hope that I can get non-conflicting merges done fairly soon, and > > maybe I can con James or Jeff or somebody to try out GIT then... > > I don't mind being a guinea pig as long as someone else does the hard > work of finding a new way to merge :) So I can tell you what merges are going to be like, just to prepare you. First, the good news: I think we can make the workflow look like bk, ie pretty much like "git pull" and "git push". And for well-behaved stuff (ie minimal changes to the same files on both sides) it will even be fast. I think. Then the bad news: the merge algorithm is going to suck. It's going to be just plain 3-way merge, the same RCS/CVS thing you've seen before. With no understanding of renames etc. I'll try to find the best parent to base the merge off of, although early testers may have to tell the piece of crud what the most recent common parent was. So anything that got modified in just one tree obviously merges to that version. Any file that got modified in two trees will end up just being passed to the "merge" program. See "man merge" and "man diff3". The merger gets to fix up any conflicts by hand. Quite frankly, that means that we really want to avoid any "exciting" merges with GIT. Maybe somebody can come up with something smarter. Eventually. Don't count on it, at least not in the near future. The good news is that it's not like a three-way file merge is any worse than many people are used to. The bad news is that BK is just a hell of a lot better. So anybody who has been depending heavily on BK merges (and hey, the beauty of them is that you often don't even _know_ that you are depending on them) will be a bit bummed by the "Welcome back to the 1980's" message from a three-way merge. Linus
From: Ingo Molnar [9] [email blocked] Subject: Re: New SCM and commit list Date: Mon, 11 Apr 2005 09:38:44 +0200 * Linus Torvalds [email blocked] wrote: > Then the bad news: the merge algorithm is going to suck. It's going to > be just plain 3-way merge, the same RCS/CVS thing you've seen before. > With no understanding of renames etc. I'll try to find the best parent > to base the merge off of, although early testers may have to tell the > piece of crud what the most recent common parent was. > > So anything that got modified in just one tree obviously merges to > that version. Any file that got modified in two trees will end up just > being passed to the "merge" program. See "man merge" and "man diff3". > The merger gets to fix up any conflicts by hand. at that point Chris Mason's "rej" tool is pretty nifty: ftp://ftp.suse.com/pub/people/mason/rej/rej-0.13.tar.gz [10] it gets the trivial rejects right, and is pretty powerful to quickly cycle through the nontrivial ones too. It shows the old and new code side by side too, etc. (There is no fully automatic mode in where it would not bother the user with the really trivial rejects - but it has an automatic mode where you basically have to do nothing - maybe a fully automatic one could be added that would resolve low-risk rejects?) it's really easy to use (but then again i'm a vim user, so i'm biased), just try it on a random .rej file you have ("rej -a kernel/sched.c.rej" or whatever). Ingo



Related Links:


Source URL:
http://kerneltrap.org/node/4982