Linux: Importing The Kernel Into git, Merging

Submitted by Jeremy
on April 18, 2005 - 5:22am

Over the weekend, Thomas Gleixner and Ingo Molnar [interview] both managed to separately import the complete kernel history into git format. Thomas worked using BitKeeper to export and create his tree, while Ingo used only the CVS tree as a source. Thomas noted that the archive, which contains 3 years of development history, is comprised of 500,000 objects using 3.2 GiB of disk space. Linus commented that he was very happy with this, "mainly because it seems to match my estimates to a tee. Which means that I just feel that much more confident about git actually being able to handle the kernel long-term, and not just as a stop-gap measure." In another email, he went on to note, "the roughly 10x expansion of archive size going from BK to git ends up in a similar 10x bandwidth expansion, in addition to just the overhead of reading tons of directory entries and comparing them (which is what both a wget and rsync thing ends up doing). I'm sure we can bring that down with smarter synchronization tools, but I also suspect that's some way away."

To begin actual work, it was decided that there was no need to actually have all three years of kernel history in the git repository, which would require each developer to rsync 3+ GB of data. Instead, Linus created a new repository without history from the current version of the kernel, "I'm not going to announce it on linux-kernel yet, because I don't think it's useful to anybody but a git person anyway." He then successfully received a merge from arm maintainer Russell King [interview], with a comment stating, "first ever true git merge. Let's see if it actually works." The merge was a simple one as there were no file conflicts, but as a first step it helped to prove the concepts. Linus added, "it may not be pretty, but it seems to have worked fine!" The git mailing list continue to be very active as rapid development of git continues.


From: Thomas Gleixner [email blocked]
To:  git
Subject: BK -> git export done
Date: 	Sat, 16 Apr 2005 17:57:26 +0200

Hi folks,

I managed finally to export the complete kernel history into git format
The resulting number of objects is ~ 500000
The required disk space is ~ 3.2 GiB

We also tracked the blob/tree/commit references in a SQL database. We
will post a SQL dump when the database is in a bit better shape. This
should make history tracking quite simple.

I currently figure out a way to post the data. My poor DSL line is a bit
too slow :)

tglx


From: Thomas Gleixner [email blocked] Subject: Full history Date: Sun, 17 Apr 2005 01:52:56 +0200 Hi, I can publish the stuff on monday from a university nearby. --- total blob objects = 228384 total tree objects = 172507 total commit objects = 55877 The "empty" changesets which are noting merges are omitted at the moment. Is it of interest to include them ?? It might also be interesting to export/merge the various subsystem/maintainer trees including 2.4 into this archive. This would cover the complete history Disk space according to # du -sh blobs ~ 2GiB tree and commit objects ~ 1.3GiB I looked at the spread of the 450k+ objects over the 256 subdirectories in my exported git repository: total 456768 max per XX subdir = 1646 avg per XX subdir = 1784 min per XX subdir = 1936 tglx
From: Ingo Molnar [email blocked] Subject: full kernel history, in patchset format Date: Sat, 16 Apr 2005 15:15:28 +0200 i've converted the Linux kernel CVS tree into 'flat patchset' format, which gave a series of 28237 separate patches. (Each patch represents a changeset, in the order they were applied. I've used the cvsps utility.) the history data starts at 2.4.0 and ends at 2.6.12-rc2. I've included a script that will apply all the patches in order and will create a pristine 2.6.12-rc2 tree. it needed many hours to finish, on a very fast server with tons of RAM, and it also needed a fair amount of manual work to extract it and to make it usable, so i guessed others might want to use the end result as well, to try and generate large GIT repositories from them (or to run analysis over the patches, etc.). the patches contain all the existing metadata, dates, log messages and revision history. (What i think is missing is the BK tree merge information, but i'm not sure we want/need to convert them to GIT.) it's a 136 MB tarball, which can be downloaded from: http://kernel.org/pub/linux/kernel/people/mingo/Linux-2.6-patchset/ the ./generate-2.6.12-rc2 script generates the 2.6.12-rc2 tree into linux/, from scratch. (No pre-existing kernel is needed, as 2.patch generates the full 2.4.0 kernel tree.) The patching takes a couple of minutes to finish, on a fast box. below i've attached a sample patch from the series. note: i kept the patches the cvsps utility generated as-is, to have a verifiable base to work on. There were a very small amount of deltas missed (about a dozen), probably resulting from CVS related errors, these are included in the diff-CVS-to-real patch. Also, the patch format cannot create the Documentation/logo.gif file, so the script does this too - just to be able to generate a complete 2.6.12-rc2 tree that is byte-for-byte identical to the real thing. Ingo --------------------- PatchSet 1234 Date: 2002/04/11 18:29:07 Author: viro Branch: HEAD Tag: (none) Log: [PATCH] crapectomy in include/linux/nfsd/syscall.h Removes an atavism in declaration of sys_nfsservctl() - sorry, I should've remove that junk when cond_syscall() thing was done. BKrev: 3cb5c7e3phTYgiz1YLsjQ_McTo9pOQ Members: ChangeSet:1.1234->1.1235 include/linux/nfsd/syscall.h:1.3->1.4 Index: linux/include/linux/nfsd/syscall.h =================================================================== RCS file: /home/mingo/linux-CVS/linux/include/linux/nfsd/syscall.h,v retrieving revision 1.3 retrieving revision 1.4 diff -u -r1.3 -r1.4 --- linux/include/linux/nfsd/syscall.h 15 Mar 2002 23:06:06 -0000 1.3 +++ linux/include/linux/nfsd/syscall.h 11 Apr 2002 17:29:07 -0000 1.4 @@ -132,11 +132,7 @@ /* * Kernel syscall implementation. */ -#if defined(CONFIG_NFSD) || defined(CONFIG_NFSD_MODULE) extern asmlinkage long sys_nfsservctl(int, struct nfsctl_arg *, void *); -#else -#define sys_nfsservctl sys_ni_syscall -#endif extern int exp_addclient(struct nfsctl_client *ncp); extern int exp_delclient(struct nfsctl_client *ncp); extern int exp_export(struct nfsctl_export *nxp);
From: Linus Torvalds [email blocked] Subject: Re: full kernel history, in patchset format Date: Sat, 16 Apr 2005 10:04:31 -0700 (PDT) On Sat, 16 Apr 2005, Ingo Molnar wrote: > > i've converted the Linux kernel CVS tree into 'flat patchset' format, > which gave a series of 28237 separate patches. (Each patch represents a > changeset, in the order they were applied. I've used the cvsps utility.) > > the history data starts at 2.4.0 and ends at 2.6.12-rc2. I've included a > script that will apply all the patches in order and will create a > pristine 2.6.12-rc2 tree. Hey, that's great. I got the CVS repo too, and I was looking at it, but the more I looked at it, the more I felt that the main reason I want to import it into git ends up being to validate that my size estimates are at all realistic. I see that Thomas Gleixner seems to have done that already, and come to a figure of 3.2GB for the last three years, which I'm very happy with, mainly because it seems to match my estimates to a tee. Which means that I just feel that much more confident about git actually being able to handle the kernel long-term, and not just as a stop-gap measure. But I wonder if we actually want to actually populate the whole history.. Now that my size estimates have been verified, I have little actual real reason to put the history into git. There are no visualization tools done for git yet, and no helpers to actually find problems, and by the time there will be, we'll have new history. So I'd _almost_ suggest just starting from a clean slate after all. Keeping the old history around, of course, but not necessarily putting it into git now. It would just force everybody who is getting used to git in the first place to work with a 3GB archive from day one, rather than getting into it a bit more gradually. What do people think? I'm not so much worried about the data itself: the git architecture is _so_ damn simple that now that the size estimate has been confirmed, that I don't think it would be a problem per se to put 3.2GB into the archive. But it will bog down "rsync" horribly, so it will actually hurt synchronization untill somebody writes the rev-tree-like stuff to communicate changes more efficiently.. IOW, it smells to me like we don't have the infrastructure to really work with 3GB archives, and that if we start from scratch (2.6.12-rc2), we can build up the infrastructure in parallell with starting to really need it. But it's _great_ to have the history in this format, especially since looking at CVS just reminded me how much I hated it. Comments? Linus
From: Ingo Molnar [email blocked] Subject: Re: full kernel history, in patchset format Date: Sat, 16 Apr 2005 21:41:26 +0200 * Linus Torvalds [email blocked] wrote: > > the history data starts at 2.4.0 and ends at 2.6.12-rc2. I've included a > > script that will apply all the patches in order and will create a > > pristine 2.6.12-rc2 tree. > > Hey, that's great. I got the CVS repo too, and I was looking at it, > but the more I looked at it, the more I felt that the main reason I > want to import it into git ends up being to validate that my size > estimates are at all realistic. > > I see that Thomas Gleixner seems to have done that already, and come > to a figure of 3.2GB for the last three years, which I'm very happy > with, mainly because it seems to match my estimates to a tee. [...] (yeah, we apparently worked in parallel - i only learned about his efforts after i sent my mail. He was using BK to extract info, i was using the CVS tree alone and no BK code whatsoever. (I dont think there will be any argument about who owns what, but i wanted to be on the safe side, and i also wanted to see how complete and usable the CVS metadata is - it's close to perfect i'd say, for the purposes i care about.)) > But I wonder if we actually want to actually populate the whole > history.. yeah, it definitely feels a bit brave to import 28,000 changesets into a source-code database project that will be a whopping 2 weeks old in 2 days ;) Even if we felt 100% confident about all the basics (which we do of course ;), it's just simply too young to tie things down via a 3.2GB database. It feels much more natural to grow it gradually, 28,000 changesets i'm afraid would just suffocate the 'project growth dynamics'. Not going too fast is just as important as not going too slow. I didnt generate the patchset to get it added into some central repository right now, i generated it to check that we _do_ have all the revision history in an easy to understand format which does generate today's kernel tree, so that we can lean back and worry about the full database once things get a bit more settled down (in a couple of months or so). It's also an easy testbed for GIT itself. but the revision history was one of the main reasons i used BK myself, so we'll need a merged database eventually. Occasionally i needed to check who was the one who touched a particular piece of code - was that fantastic new line of code written by me, or was that buggy piece of crap written by someone else? ;) Also, looking at a change and then going to the changeset that did it, and then looking at the full picture was pretty useful too. So that sort of annotation, and generally navigating around _quickly_ and looking at the 'flow' of changes going into a particular file was really useful (for me). Ingo
From: Linus Torvalds [email blocked] Subject: Re-done kernel archive - real one? Date: Sat, 16 Apr 2005 16:01:45 -0700 (PDT) Ok, nobody really objected to the notion of leaving the kernel history behind for now, and in fact most people seemed to basically agree. So with that decided, the old kernel testing tree was actually perfectly ok, except it had been build up with the old-style commit date handling, which made me not want to use it as a base for any real work. So I re-created the dang thing (hey, it takes just a few minutes), and pushed it out, and there's now an archive on kernel.org in my public "personal" directory called "linux-2.6.git". I'll continue the tradition of naming git-archive directories as "*.git", since that really ends up being the ".git" directory for the checked-out thing. I'm not going to announce it on linux-kernel yet, because I don't think it's useful to anybody but a git person anyway. Besides, I don't actually know how happy the kernel.org people are about this distribution method and whether it ends up being a horrible disaster for the mirroring setup. Peter made some noises about /pub/scm, which makes sense, and would be a better place than my public tree. Apparently there are other places that are willing and able to host things too, so we'll see. NOTE! The roughly 10x expansion of archive size goind from BK to git ends up in a similar 10x bandwidth expansion, in addition to just the overhead of reading tons of directory entries and comparing them (which is what both a wget and rsync thing ends up doing). I'm sure we can bring that down with smarter synchronization tools, but I also suspect that's some way away. So is real common usage, though, so maybe it's not that bad at all. Who knows. We haven't hit a single real snag so far (except it took several days longer than I expected, but hey, I expect lots of things ;), and I'm sure real usage will show lots of them. Similarly, we don't really have real merging, which makes tracking harder, but I suspect actually having a tree out there will make people more motivated and have more of a test-case. I'm feeling good enough about the plumbing that I think I solved the "hard" part of it, and now it's just the boring 95% left - scripting around it. I think that with the new merge model, the easiest thing to do is to just download all new objects, and then download the HEAD file under a new name. Ie we have two phases to the merge: first get the objects, with something like repo=kernel.org:/pub/kernel/people/torvalds/linux-2.6.git rsync --ignore-existing -acv $(repo)/ .git/ which will _not_ download the new HEAD file (since you already have one of your own), and then when you actually decide to merge you do rsync -acv $(repo)/HEAD .git/MERGE_WITH and now you can look at your old HEAD, and the MERGE_WITH thing, look up the parents, and then do read-tree -m <parent-tree> <head-tree> <merge-with-tree> write-tree commit-tree <result-tree> -p <head-tree> -p <merge-with-tree> (which should actually _work_, assuming that the merge had no file conflicts). This seems to be a sane way to do merges, and if the scripting starts from there and then becomes smarter... Linus
From: Russell King [email blocked] Subject: Re: Re-done kernel archive - real one? Date: Sun, 17 Apr 2005 16:24:48 +0100 On Sat, Apr 16, 2005 at 04:01:45PM -0700, Linus Torvalds wrote: > So I re-created the dang thing (hey, it takes just a few minutes), and > pushed it out, and there's now an archive on kernel.org in my public > "personal" directory called "linux-2.6.git". I'll continue the tradition > of naming git-archive directories as "*.git", since that really ends up > being the ".git" directory for the checked-out thing. We need to work out how we're going to manage to get our git changes to you. At the moment, I've very little idea how to do that. Ideas? At the bottom is the script itself. There's probably some aspects of it which aren't nice, maybe Petr can advise on this (and maybe increase the functionality of the git shell script to fill in where necessary.) However, I've made a start to generate the necessary emails. How about this format? I'm not keen on the tree, parent, author and committer objects appearing in this - they appear to clutter it up. What're your thoughts? I'd rather not have the FQDN of the machine where the commit happened appearing in the logs. (I've 'xxxx'd it out for the time being, because I'd rather not have yet more email-address-like objects get into spammers databases with which to hammer my 512kbps DSL line.) Linus, Please incorporate the latest ARM changes. This will update the following files: arm/kernel/process.c | 15 +++++++++++---- arm/kernel/traps.c | 8 ++------ arm/lib/changebit.S | 11 ++--------- arm/lib/clearbit.S | 13 ++----------- arm/lib/setbit.S | 11 ++--------- arm/lib/testchangebit.S | 15 ++------------- arm/lib/testclearbit.S | 15 ++------------- arm/lib/testsetbit.S | 15 ++------------- arm/mach-footbridge/dc21285-timer.c | 4 ++-- arm/mach-sa1100/h3600.c | 2 +- asm-arm/ptrace.h | 5 +---- asm-arm/system.h | 3 +++ 12 files changed, 32 insertions(+), 85 deletions(-) through these ChangeSets: tree 7c4d75539c29ef7a9dde81acf84a072649f4f394 parent d5922e9c35d21f0b6b82d1fd8b1444cfce57ca34 author Russell King [email blocked] 1113749462 +0100 committer Russell King [email blocked] 1113749462 +0100 [PATCH] ARM: bitops Convert ARM bitop assembly to a macro. All bitops follow the same format, so it's silly duplicating the code when only one or two instructions are different. Signed-off-by: Russell King [email blocked] tree fc10d3ffa6062cda10a10cb8262d8df238aea4fb parent 5d9a545981893629c8f95e2b8b50d15d18c6ddbc author Russell King [email blocked] 1113749436 +0100 committer Russell King [email blocked] 1113749436 +0100 [PATCH] ARM: showregs Fix show_regs() to provide a backtrace. Provide a new __show_regs() function which implements the common subset of show_regs() and die(). Add prototypes to asm-arm/system.h Signed-off-by: Russell King [email blocked] tree 5591fced9a2b5f84c6772dcbe2eb4b24e29161fc parent 488faba31f59c5960aabbb2a5877a0f2923937a3 author Russell King [email blocked] 1113748846 +0100 committer Russell King [email blocked] 1113748846 +0100 [PATCH] ARM: h3600_irda_set_speed arguments h3600_irda_set_speed() had the wrong type for the "speed" argument. Fix this. Signed-off-by: Russell King [email blocked] tree 2493491da6e446e48d5443f0a549a10ed3d35b62 parent e7905b2f22eb5d5308c9122b9c06c2d02473dd4f author Russell King [email blocked] 1113748615 +0100 committer Russell King [email blocked] 1113748615 +0100 [PATCH] ARM: footbridge rtc init The footbridge ISA RTC was being initialised before we had setup the kernel timer. This caused a divide by zero error when the current time of day is set. Resolve this by initialising the RTC after the kernel timer has been initialised. Signed-off-by: Russell King [email blocked] --- #!/bin/sh prev=$(cat .git/heads/origin) to=$(cat .git/HEAD) who=Linus what=ARM cat << EOT ${who}, Please incorporate the latest ${what} changes. This will update the following files: EOT git diff $prev $to | diffstat -p1 cat << EOT through these ChangeSets: EOT this=$to while [ "$this" != "$prev" ]; do cat-file commit $this | sed 's,.*,\t&,' this=$(cat-file commit $this | grep ^parent | cut -d ' ' -f 2) done -- Russell King
From: Linus Torvalds [email blocked] Subject: Re: Re-done kernel archive - real one? Date: Sun, 17 Apr 2005 09:36:09 -0700 (PDT) On Sun, 17 Apr 2005, Russell King wrote: > > On Sat, Apr 16, 2005 at 04:01:45PM -0700, Linus Torvalds wrote: > > So I re-created the dang thing (hey, it takes just a few minutes), and > > pushed it out, and there's now an archive on kernel.org in my public > > "personal" directory called "linux-2.6.git". I'll continue the tradition > > of naming git-archive directories as "*.git", since that really ends up > > being the ".git" directory for the checked-out thing. > > We need to work out how we're going to manage to get our git changes to > you. At the moment, I've very little idea how to do that. Ideas? To me, merging is my highest priority. I suspect that once I have a tree from you (or anybody else) that I actually _test_ merging with, I'll be motivated as hell to make sure that my plumbing actually works. After all, it's not just you who want to have to avoid the pain of merging: it's definitely in my own best interests to make merging as easy as possible. You're _the_ most obvious initial candidate, because your merges almost never have any conflicts at all, even on a file level (much less within a file). > However, I've made a start to generate the necessary emails. How about > this format? > > I'm not keen on the tree, parent, author and committer objects appearing > in this - they appear to clutter it up. What're your thoughts? Indeed. I'd almost drop the whole header except for the "author" line. Oh, and you need a separator between commits, right now your "Signed-off-by:" line ends up butting up with the header of the next commit ;) > I'd rather not have the FQDN of the machine where the commit happened > appearing in the logs. That's fine. Out short-logs have always tried to have just the real name in them, and I do want an email-like thing for tracking the developer, but yes, if you remove the email, that's fine. It should be easy enough to do with a simple sed 's/<.*>//' or similar. And if you replace "author" with "From:" and do the date conversion, it might look more natural. Linus
From: Russell King [email blocked] Subject: Re: Re-done kernel archive - real one? Date: Sun, 17 Apr 2005 19:57:42 +0100 On Sun, Apr 17, 2005 at 09:36:09AM -0700, Linus Torvalds wrote: > On Sun, 17 Apr 2005, Russell King wrote: > > On Sat, Apr 16, 2005 at 04:01:45PM -0700, Linus Torvalds wrote: > > > So I re-created the dang thing (hey, it takes just a few minutes), and > > > pushed it out, and there's now an archive on kernel.org in my public > > > "personal" directory called "linux-2.6.git". I'll continue the tradition > > > of naming git-archive directories as "*.git", since that really ends up > > > being the ".git" directory for the checked-out thing. > > > > We need to work out how we're going to manage to get our git changes to > > you. At the moment, I've very little idea how to do that. Ideas? > > To me, merging is my highest priority. I suspect that once I have a tree > from you (or anybody else) that I actually _test_ merging with, I'll be > motivated as hell to make sure that my plumbing actually works. Ok, I'll throw this tree onto master.kernel.org - how about master.kernel.org:/home/rmk/linux-2.6-rmk.git ? I think it's in the same format as your trees: linux-2.6-rmk.git |-- HEAD `-- objects where HEAD was copied from my .git/heads/master, and objects from .git/objects. > > However, I've made a start to generate the necessary emails. How about > > this format? > > > > I'm not keen on the tree, parent, author and committer objects appearing > > in this - they appear to clutter it up. What're your thoughts? > > Indeed. I'd almost drop the whole header except for the "author" line. Done. > Oh, and you need a separator between commits, right now your > "Signed-off-by:" line ends up butting up with the header of the next > commit ;) Done. > > I'd rather not have the FQDN of the machine where the commit happened > > appearing in the logs. > > That's fine. Out short-logs have always tried to have just the real name > in them, and I do want an email-like thing for tracking the developer, but > yes, if you remove the email, that's fine. It should be easy enough to do > with a simple > > sed 's/<.*>//' > > or similar. Done. > And if you replace "author" with "From:" and do the date conversion, it > might look more natural. Also done. 8) I still need to work out how to make my noddy script follow different branches which may be present though. However, for my common work flow, it fits what I require. Ok, how about this format: Linus, Please incorporate the latest ARM changes. This will update the following files: arch/arm/kernel/process.c | 15 +++++++++++---- arch/arm/kernel/traps.c | 8 ++------ arch/arm/lib/changebit.S | 11 ++--------- arch/arm/lib/clearbit.S | 13 ++----------- arch/arm/lib/setbit.S | 11 ++--------- arch/arm/lib/testchangebit.S | 15 ++------------- arch/arm/lib/testclearbit.S | 15 ++------------- arch/arm/lib/testsetbit.S | 15 ++------------- arch/arm/mach-footbridge/dc21285-timer.c | 4 ++-- arch/arm/mach-sa1100/h3600.c | 2 +- include/asm-arm/arch-ebsa285/debug-macro.S | 7 +++++-- include/asm-arm/arch-rpc/debug-macro.S | 5 ++++- include/asm-arm/ptrace.h | 5 +---- include/asm-arm/system.h | 3 +++ 14 files changed, 41 insertions(+), 88 deletions(-) through these ChangeSets: Author: Russell King: Sun Apr 17 16:28:31 BST 2005 [PATCH] ARM: fix debug macros Fix debug EBSA285 and RiscPC debugging macros to detect whether the MMU is enabled. Signed-off-by: Russell King Author: Russell King: Sun Apr 17 15:51:02 BST 2005 [PATCH] ARM: bitops Convert ARM bitop assembly to a macro. All bitops follow the same format, so it's silly duplicating the code when only one or two instructions are different. Signed-off-by: Russell King Author: Russell King: Sun Apr 17 15:50:36 BST 2005 [PATCH] ARM: showregs Fix show_regs() to provide a backtrace. Provide a new __show_regs() function which implements the common subset of show_regs() and die(). Add prototypes to asm-arm/system.h Signed-off-by: Russell King Author: Russell King: Sun Apr 17 15:40:46 BST 2005 [PATCH] ARM: h3600_irda_set_speed arguments h3600_irda_set_speed() had the wrong type for the "speed" argument. Fix this. Signed-off-by: Russell King Author: Russell King: Sun Apr 17 15:36:55 BST 2005 [PATCH] ARM: footbridge rtc init The footbridge ISA RTC was being initialised before we had setup the kernel timer. This caused a divide by zero error when the current time of day is set. Resolve this by initialising the RTC after the kernel timer has been initialised. Signed-off-by: Russell King -- Russell King
From: David Woodhouse [email blocked] Subject: Re: Re-done kernel archive - real one? Date: Mon, 18 Apr 2005 07:52:15 +1000 On Sat, 2005-04-16 at 16:01 -0700, Linus Torvalds wrote: > So I re-created the dang thing (hey, it takes just a few minutes), and > pushed it out, and there's now an archive on kernel.org in my public > "personal" directory called "linux-2.6.git". I'll continue the tradition > of naming git-archive directories as "*.git", since that really ends up > being the ".git" directory for the checked-out thing. Do you want the commits list running for it yet? Do you want the changesets which are already in it re-mailed without a 'TESTING' tag? -- dwmw2
From: Linus Torvalds [email blocked] Subject: Re: Re-done kernel archive - real one? Date: Sun, 17 Apr 2005 15:17:50 -0700 (PDT) On Mon, 18 Apr 2005, David Woodhouse wrote: > > Do you want the commits list running for it yet? Do you want the > changesets which are already in it re-mailed without a 'TESTING' tag? I really don't know. I'm actually very happy where this thing is right now, and completing that first merge successfully was a big milestone to me personally. That said, actually _using_ this thing is not for the faint-of-heart, and while I think "git" already is showing itself to be useful, I'm very very biased. In other words, I really wonder what an outsider that doesn't have the same kind of mental bias thinks of the current git tree. Is it useful, or is it still just a toy for Linus to test out his crazy SCM-wannabe. Can people usefully track my current kernel git repository, or do you have to be crazy to do so? That's really the question. You be the judge. Me, I'm just giddy from a merge that was clearly done using interfaces that aren't actually really usable for anybody but me, and barely me at that ;) Linus Btw, I also do want this to show up in the BK trees for people who use BitKeeper - the same way we always supported tar-ball + patch users before. So I'll have to try to come up with some sane way to do that too. Any ideas? The first series of 198 patches is obvious enough and can be just done that way direcly, but the merge..
From: Linus Torvalds [email blocked] Subject: First ever real kernel git merge! Date: Sun, 17 Apr 2005 15:10:25 -0700 (PDT) It may not be pretty, but it seems to have worked fine! Here's my history log (with intermediate checking removed - I was being pretty anal ;): rsync -avz --ignore-existing \ master.kernel.org:/home/rmk/linux-2.6-rmk.git/ .git/ rsync -avz --ignore-existing \ master.kernel.org:/home/rmk/linux-2.6-rmk.git/HEAD .git/MERGE-HEAD merge-base $(cat .git/HEAD) $(cat .git/MERGE-HEAD) for i in e7905b2f22eb5d5308c9122b9c06c2d02473dd4f $(cat .git/HEAD) \ $(cat .git/MERGE-HEAD); do cat-file commit $i | head -1; done read-tree -m cf9fd295d3048cd84c65d5e1a5a6b606bf4fddc6 \ 9c78e08d12ae8189f3bd5e03accc39e3f08e45c9 a43c4447b2edc9fb01a6369f10c1165de4494c88 write-tree commit-tree 7792a93eddb3f9b8e3115daab8adb3030f258ce6 -p $(cat .git/HEAD) \ -p $(cat .git/MERGE-HEAD) echo 5fa17ec1c56589476c7c6a2712b10c81b3d5f85a > .git/HEAD fsck-cache --unreachable 5fa17ec1c56589476c7c6a2712b10c81b3d5f85a which looks really messy, because I really wanted to do each step slowly by hand, so those magic revision numbers are just cut-and-pasted from the results that all the previous stages had printed out. NOTE! As expected, this merge had absolutely zero file-level clashes, which is why I could just do the "read-tree -m" followed by a write-tree. But it's a real merge: I had some extra commits in my tree that were not in Russell's tree, and obviously vice versa. Also note! The end result is not actually written back to the corrent working directory, so to see what the merge result actually is, there's another final phase: read-tree 7792a93eddb3f9b8e3115daab8adb3030f258ce6 update-cache --refresh checkout-cache -f -a which just updates the current working directory to the results. I'm _not_ caring about old dirty state for now - the theory was to get this thing working first, and worry about making it nice to use later. A second note: a real "merge" thing should notice that if the "merge-base" output ends up being one of the inputs (it one side is a strict subset of the other side), then the merge itself should never be done, and the script should just update directly to which-ever is non-common HEAD. But as far as I can tell, this really did work out correctly and 100% according to plan. As a result, if you update to my current tree, the top-of-tree commit should be: cat-file commit $(cat .git/HEAD) tree 7792a93eddb3f9b8e3115daab8adb3030f258ce6 parent 8173055926cdb8534fbaed517a792bd45aed8377 parent df4449813c900973841d0fa5a9e9bc7186956e1e author Linus Torvalds [email blocked] 1113774444 -0700 committer Linus Torvalds [email blocked] 1113774444 -0700 Merge with master.kernel.org:/home/rmk/linux-2.6-rmk.git - ARM changes First ever true git merge. Let's see if it actually works. Yehaa! It did take basically zero time, btw. Except for my bunbling about, and the first "rsync the objects from rmk's directory" part (which wasn't horrible, it just wasn't instantaneous like the other phases). Btw, to see the output, you really want to have a "git log" that sorts by date. I had an old "gitlog.sh" that did the old recursive thing, and while it shows the right thing, the ordering ended up making it be very non-obvious that rmk's changes had been added recently, since they ended up being at the very bottom. Linus
From: randy_dunlap [email blocked] Subject: yet another gitting started Date: Sun, 17 Apr 2005 19:55:12 -0700 Here's the beginnings of yet another git usage/howto/tutorial. It can grow or die... I'll gladly take patches for it, or Pasky et al can merge more git plumbing and toilet usages into it, with or without me. http://www.xenotime.net/linux/doc/git-usage-howto --- ~Randy

Related Links:

So are they (kernel people) g

Anonymous (not verified)
on
April 18, 2005 - 2:36pm

So are they (kernel people) going to use git in a long term?

I was under the impression that they where only going to use git
for a short term solution

still don`t really get why linus wrote a basic scm
why not putting all that effort in something that is really good
in a long term (bk killer scm, bazaar-ng or something like that)

git is a long term solution

Anonymous (not verified)
on
April 18, 2005 - 3:45pm

Linus really looked hard into bazaar-ng, arch/tla, darcs and various others. They are simply too slow for the things he wants to do.

The kernel tree is > 100MB of code, and the other scms really dont scale well to those sizes.

git is being developed REALLY REALLY quickly, features are being added left and right, and the core is already very stable. Pretty soon it will be able to do all the things the devs need.

git is not an SCM, it is just a versioning filesystem (in userspace) which supports (will support) merging and branching.

A versioning filesystem is an

Anonymous (not verified)
on
April 18, 2005 - 11:44pm

A versioning filesystem is an SCM. (See the canonical analogy, ODS5 vs RCS ;)
And, seeing the 10x expansion in repository size, and in required bandwidth, is git not "too slow"?
Let's look at the *BSDs, they still use CVS. Their trees are well over 100M too.
And, while CVS operations seem really really slow indeed, can it be done better?

ODS5 is the file system used

Anonymous (not verified)
on
April 18, 2005 - 11:51pm

ODS5 is the file system used by VMS. It supports file versioning.
And I think this "git is not an SCM" hype really has one purpose, to avoid comparison with other SCMs.
Which, IMO, would be interesting and beneficial. I don't see why Linus wants to avoid it...
But if it's "not an SCM", OK, let's compare it with other versioning file systems. If it beats ODS5's performance, my respect. (Hint: I cannot believe it ever would.)
Should we develop some benchamrk suite for SCMs? :)

CVS is no good ...

Anonymous (not verified)
on
April 19, 2005 - 3:55am

"Let's look at the *BSDs, they still use CVS"

But the BSD guys work in a completely different manner to Linus.

Most SCMs are designed to work on the "I'm the boss and I pay your wages, you'll do what I tell you" basis. For the BSDs that works because you have one guy controlling everything and a *s*m*a*l*l* team of developers. So you have ONE master setup, and everybody works in their own little patch of it - management ensuring that you don't get two people working on the same stuff at the same time.

That last sentence is the BIG PROBLEM behind most SCMs. They cannot cope with "several workers, one bit of code". Two workers trying to check out the same bit of code? Most SCMs have conniptions and then fall over in a heap. Then try and use that to manage Linux, where you get hundreds of developers all falling over the same *few* *lines* *of* *code* to fix a bug or whatever, and pretty much any semi-commercial SCM is dead in the water even before it's gone down the slipway!

The reason BK was so good for Linus is that (a) it allowed every developer to have their *own* *master* version, and (b) it allowed every developer to say "I want *these* changes, but *not* *those*, from *that* master tree". So, for example, Andrew Morton maintains the mm testing tree with maybe 30 different enhancements over Linus' main tree. And when Linus decided that feature X was mature enough, he did a "bk get" on *just* *that* *one* *feature*.

CVS, being file based, would have been a pain. And, to make it worse, if several other features had been added to the files Linus wanted, he would have got those features as well.

Basically, BK (and now git) are different from other SCMs because they have distributed masters, not just one HEAD, and they allow selective merging of feature sets between trees, not simply checking in and out individual files.

Cheers,
Wol

Most clarifying description

Federico (not verified)
on
April 19, 2005 - 5:01am

Thank you! This is the most interesting and clarifying thing I've ever read about git/cvs/linux kernel management in years !

"So, for example, Andrew Mort

Anonymous (not verified)
on
April 19, 2005 - 8:34pm

"So, for example, Andrew Morton maintains the mm testing tree with maybe 30 different enhancements over Linus' main tree. And when Linus decided that feature X was mature enough, he did a "bk get" on *just* *that* *one* *feature*."

well, the exaple is ok. however, not sure for now but -mm tree had been maintained by Andrew without bk. see http://www.zip.com.au/~akpm/linux/patches/patch-scripts-0.18/docco.txt

I think the difference is the

Anonymous (not verified)
on
April 19, 2005 - 4:02am

I think the difference is the way the respective kernels are developed.
BSD (I'm guessing here) has one (main) tree and the developers (five, ten or more?) who are entrusted with CVS-write maintain that tree. Anyone of those can make any change they like. They also spread the slowness of CVS among them.

Linux, OTOH, has one (main) tree with one trusted developer (Linus). The other developers have their own trees that they can mess up in any way they like. And when they need to put something in the main tree they ping Linus to do that for them. The result: Linus alone suffers any CVS slowness.

So to summarize: if ten BSD developers need to put one thing each in the tree, they each suffer CVS-slowness*1. If ten Linux developers need to put one thing each in the tree, Linus suffers CVS-slowness*10.

- Peder

The repository is large, but

Anonymous (not verified)
on
April 20, 2005 - 3:47am

The repository is large, but the chief operations the tree should support are fast: committing changes, merging patches. The increase in size is largerly because of want to keep things simple. For instance, when a file is changed always a whole copy of that file will be stored, rather than some kind of incremental line-based or byte-based diff.

The other systems encode and store the differences, but then it gets harder to get, say, the latest version or the version 5 revisions ago, because you have to track the incremental changes and possibly do quite a bit of reading of the repository in order to compute the contents of the file at the version of interest.

There's not much other reasons why git repo is large. These things are tradeoffs. You can have speed and simplicity but then end up paying with size, sometimes. But it still doesn't mean that it weren't fast for whatever you wanted to do with it. Linus had very clear design goals from start and made git according to them.

Could be improved

Anonymous (not verified)
on
April 22, 2005 - 12:16pm

I agree that this is probably very optimized for read/write but not for space. This space problem could be optimized with a simple on the fly patch engine and an intermediate patch distance parameter. This parameter would be set per repository or even down to a file granularity if desired.

This could work in this manner:

Intermediate patch distance (Z) = 5

Write Action                Repository storage    Read Action
import file a.c (0)      -> whole file         -> whole file
commit file a.c diff (1) -> diff to (0)        -> patch (0+1)
commit file a.c diff (2) -> diff to (1)        -> patch (0+1+2)
commit file a.c diff (3) -> diff to (2)        -> patch (0+1+2+3)
commit file a.c diff (4) -> diff to (3)        -> patch (0+1+2+3+4)
commit file a.c diff (5) -> diff to (4)        -> patch (0+1+2+3+4+5)
commit file a.c diff (6) -> whole file         -> whole file
commit file a.c diff (7) -> diff to (0)        -> patch (6+7)
commit file a.c diff (8) -> diff to (1)        -> patch (6+7+8)
...

This could reduce the size of the repository by a factor of Z (of course minus the average patch size) while keeping a reasonable read speed because git would only have to patch up to Z steps back when reading a version which should be fairly quick. The write speed should be about constant dependent on the average size of the patch (I assume that the delta is sent over the wire on a diff) and the file is generated on the repository side from the previous version + the diff.

Just and idea.

Don't think so

Magnus Sundberg
on
April 23, 2005 - 3:43am

One of Linus comments were that he wanted to trust the system, and he planned to accomplish that by keeping it as simple as possible.

Not now, of course

Anonymous (not verified)
on
April 24, 2005 - 3:40pm

But in the future. Once the unoptimized version works I'm sure people would want to fix this problem one way or the other.

All indications are that with

Kevin Smith (not verified)
on
April 18, 2005 - 8:47pm

All indications are that within a few weeks, git will become a better SCM for the kernel project than any other free SCM is today. Partly that's because it is focused very narrowly (for now) on features and constraints of the kernel. But partly it's because Linus has taken a fresh look at SCM, has come up with some innovative ideas, and is leveraging highly incremental development (and a handful of eager volunteers) to make rapid progress.

git is self-hosting (and has been from very early on), and is already being used for actual kernel work. That's pretty impressive.

Other SCMs begin using GIT

Georgi Chorbadzhiyski (not verified)
on
April 20, 2005 - 5:23am

The guys behind DARCS and Arch are starting to use git as their backend.

latest monotone release menti

Anonymous (not verified)
on
April 19, 2005 - 12:13am

latest monotone release mentions linus amongst other, don't know how that fits in the "big picture"?

monotone news

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.