I'm starting to get gfi (git-fast-import) prepared for a merge into the main git.git tree. For those who don't know, gfi is the result of my work with Jon Smirl on trying to *quickly* import the massive Mozilla CVS repository into Git. Recently its been getting a lot of attention from the KDE, OOo, Dragonfly BSD, and Qt projects. When exactly we merge it in will depend a lot on Junio. It should be safe to merge before 1.5.0 as its strictly new source files, but we may still want to wait until after 1.5.0 is out. I'm mainly worried about breaking compliation on odd architectures. gfi builds, runs and has been used for production level imports on Mac OS X, Linux and Dragonfly BSD, using both 32 bit and 64 bit architectures, but some of Git's other targets (e.g. AIX) haven't seen any testing. The gfi code is quite stable and has been getting a lot of use (and discussion) lately. A new test (t/t9300-fast-import.sh) has been added and now, finally, documentation (Documentation/git-fast-import.txt). As gfi is 1962 lines of C and its development history consists of 74 commits made over the span of 7 months (first commit was Aug 5, 2006) and several versions of core Git code (which gfi calls into, and which has gone through some non-trivial changes during that time), I'm going to ask Junio to directly pull the topic branch into git.git, rather than submitting it as patches. My topic branch is published on repo.or.cz (thanks Pasky!). I would encourage all parties who would have otherwise been interested in reviewing the patches on the mailing list to clone/fetch the topic and review it locally instead. gitweb: http://repo.or.cz/w/git/fastimport.git clone: git://repo.or.cz/git/fastimport.git I'm particularly interested in feedback on the documentation, so I am attaching it below. ------- git-fast-import(1) ================== NAME ---- git-fast-import - Backend for fast Git data importers. SYNOPSIS -------- frontend | 'git-fast-import' [options] DESCRIP...
Compilation errors are the simplest to fix, just send it in. I have to import lots of data from perforce spaghetti, so I'm very likely to try it out. -
True. But it really is annoying when you download the latest-and-greatest release of a package only to find out it doesn't compile on your OS of choice, and even worse when you find out it is because of new code that you will never use which was added in just before I can't help you with spaghetti, but the Qt folks did make their Perforce importer available. Chris Lee put it in the fast-export project on repo.or.cz. Its a relatively short Python program. Might help you get started. They created annotated tags (with no message) for every p4 changeset. I think its just because they didn't realize you can use (abuse?) the `reset` command in gfi to create lightweight tags instead. I actually implemented a "data <path" command in gfi to tell gfi to load data from a file, for this type of case where the foreign system has dropped the files in your working directory and you just want Git to read them. But there's no synchronization between gfi and the frontend (aside from the pipe buffer throttling the frontend), so there is no way for the frontend to know that gfi has finished a batch of files and its safe to ask p4 for the next revision. So I threw it away. It was only a 10 line patch anyway. :) -- Shawn. -
Yes, I saw their code. That's how I started thinking of using gfi I found it's useless to do anything with p4 changes. They lack the most important part of history: parent. The comments get useless too, because they refer to the most recent change, with no practical way to extract anything in between. Not much of a problem, nobody writes anything sensible in perforce comments anyway. -
Is <tz> /really/ expressed in minutes? 500 minutes is 8 hours 20 minutes. I know what you mean, of course; and so would anyone reading it - so I suggest just dropping the ", in minutes" - as it's not true. Andy -- Dr Andy Parkins, M Eng (hons), MIEE andyparkins@gmail.com -
Agreed. It _is_ "in minutes", but it's in an oddish human-readable base-60 format. It's certainly *not* decimal, it's more like "two decimal digits encode each base-60 digit in the obvious way". Linus -
What about this language? The time of the change is specified by `<time>` as the number of seconds since the UNIX epoc (midnight, Jan 1, 1970, UTC) and is written in base-10 notation using US-ASCII digits. The committer's timezone is specified by `<tz>` as a positive or negative offset from UTC. For example EST (which is typically 5 hours behind GMT) would be expressed in `<tz>` by ``-0500'' while GMT is ``+0000''. -- Shawn. -
EST is always 5 hours behind GMT. During the summer, EST is still 5 hours behind GMT, but the clocks which use ET are set to EDT (-0400) instead. -Daniel *This .sig left intentionally blank* -
Shawn O. Pearce <spearce@spearce.org> wrote: That is /not/ a timezone! Maybe an offset from UTC. -- Dr. Horst H. von Brand User #22616 counter.li.org Departamento de Informatica Fono: +56 32 2654431 Universidad Tecnica Federico Santa Maria +56 32 2654239 Casilla 110-V, Valparaiso, Chile Fax: +56 32 2797513 -
Indeed. Thank you for the correction. I'll push out fixed docs shortly. -- Shawn. -
Btw, one thing that might be a good idea to document very clearly: - in the native git format, the offset from UTC has *nothing* to do with the actual time itself. The time in native git is always in UTC, and the offset from UTC does not change "time" - it's purely there to tell in which timezone the event happened. So 12345678 +0000 and 12345678 -0700 are *exactly*the*same*date*, except event one happened in UTC, and the other happened in UTC-7. - in rfc2822 format, the offset from UTC actually *changes* the date. The date "Oct 12, 2006 20:00:00" will be two _different_ times when you say it is in PST or in UTC. And yes, for all I know we might get this wrong inside git too. It's easy to get confused, because they really do mean different things. For an example of this, do make test-date in git (which parses the argument using the "exact date" and "approxidate" versions respectively, and the exact date parsing will give the internal git representation on the first line in the middle column), and then: ./test-date "1234567890 -0800" ./test-date "1234567890 +0000" and then try ./test-date "Fri Feb 13 15:31:30 2009 PST" ./test-date "Fri Feb 13 15:31:30 2009 UTC" and notice how the first two (numeric) dates that differ in UTC offset will still return the exact same seconds since the epoch: 1234567890 -0800 1234567890 +0000 but the second example (with a rfc2822-like date), will show how the seconds-since-epoch changes, and gives: 1234567890 -0800 1234539090 +0000 respectively for those two dates. Logical? It actually is, but you have to understand how git represents date to see the logic. To git, the "timezone" is really totally irrelevant. It doesn't really affect the "date" at all. At most, it affects how you _print_ the date, and you can tell what timezone the computer was set to when the commit was made. And yes, I would not be at all surprised if we had some bug here where we got it wrong...
Hi, FWIW I just grepped git for tz, and looked at the results. The place I had to think a bit more about was in builtin-blame.c:format_time(). Probably a special date format is needed to stay compatible with cvsserver, otherwise show_date() or even show_rfc2822_date() could be used. The code actually adds the timezone in minutes to the timestamp, and then calls gmtime() to be able to format the date with strftime() (something similar, without strftime() is done in show_[rfc2822_]date()). The result is correct AFAICT, although it would be cleaner IMHO to add yet another function to date.c which formats the time according to cvsserver's wishes. Post 1.5.0. Ciao, Dscho -
Here is the current language relating to date parsing in gfi: Date Formats ~~~~~~~~~~~~ The following date formats are supported. A frontend should select the format it will use for this import by passing the format name in the `--date-format=<fmt>` command line option. `raw`:: This is the Git native format and is `<time> SP <offutc>`. It is also gfi's default format, if `--date-format` was not specified. + The time of the event is specified by `<time>` as the number of seconds since the UNIX epoch (midnight, Jan 1, 1970, UTC) and is written as an ASCII decimal integer. + The local offset is specified by `<offutc>` as a positive or negative offset from UTC. For example EST (which is 5 hours behind UTC) would be expressed in `<tz>` by ``-0500'' while UTC is ``+0000''. The local offset does not affect `<time>`; it is used only as an advisement to help formatting routines display the timestamp. + If the local offset is not available in the source material, use ``+0000'', or the most common local offset. For example many organizations have a CVS repository which has only ever been accessed by users who are located in the same location and timezone. In this case the offset from UTC can be easily assumed. + Unlike the `rfc2822` format, this format is very strict. Any variation in formatting will cause gfi to reject the value. `rfc2822`:: This is the standard email format as described by RFC 2822. + An example value is ``Tue Feb 6 11:22:18 2007 -0500''. The Git parser is accurate, but a little on the lenient side. Its the same parser used by gitlink:git-am[1] when applying patches received from email. + Some malformed strings may be accepted as valid dates. In some of these cases Git will still be able to obtain the correct date from the malformed string. There are also some types of malformed strings which Git will parse wrong, and yet consider valid. Seriously malformed strings will be rejected. + Unlike the `raw`...
Say what? If I use the "raw" format with UTC offset, the offset is just "already uses Unix-epoch format, can be coaxed to give dates in that Better fix that. It can't be that costly to call gettimeofday(2) once and See? -- Dr. Horst H. von Brand User #22616 counter.li.org Departamento de Informatica Fono: +56 32 2654431 Universidad Tecnica Federico Santa Maria +56 32 2654239 Casilla 110-V, Valparaiso, Chile Fax: +56 32 2797513 -
The offset that git maintaines is basically always ignored by git except for pure printout purposes. For example, when you traverse commits, git normally picks the next reachable commit to show by using the date. The UTC offset has no effect on anything. In fact, when we parse a commit, we don't even *parse* the timezone info. Look in commit.c: parse_commit_date. The timezone really doesn't even exist as far as any "real" git operation is concerned. It's just saved away, and it's _shown_ in "git log", but it has no real meaning apart from that. So git very much only works on UTC time internally, and the only thing that actually matters in a string like "1234567890 -0700" is the first part. The "-0700" is _literally_ just a comment that is only ever even parsed by "pretty_print_commit()". Btw, CVS doesn't have any TZ info at all, so CVS also internally always saves in UTC. It then tends to print out logs in whatever timezone you happen to be in at the time of printout, afaik. Linus -
Should be "It's" or "It is". -- Karl Hasselstr
I doubt it would confuse anybody. Although usually we'd not say "in base-10 notation using US-ASCII digits" the normal way to do that is to just saying "as an ASCII decimal integer". Sure, people could try to do "10,200,300" and claim it's "decimal integer", but at that point, you can just tell them they're crazy, and ignore them ;) But your text certainly isn't wrong. I just think it overspecifies a bit, at the expense of readability. Linus -
Heh, right you are! Nico's point about using parse_date() here is a really good one. I'm going to modify that section of gfi to use parse_date(), which would change the language here anyway. I'll try to not to make a silly mistake such as the above in the updated docs. :) -- Shawn. -
Do we have example frontend that can be added along with gfi ? -aneesh -
Not yet. Some frontends are available here on repo.or.cz: gitweb: http://repo.or.cz/w/fast-export.git clone: git://repo.or.cz/fast-export.git But both lack branch support, for example, so they probably aren't nearly as complete as the existing non-gfi based importers. -- Shawn. -
It might be nice to have a git-fast-export, which could actually be potentially useful for generating a repository with systematic differences from the original. (E.g., to make a repository of git's Documentation directory, with just the commits that affect it) That might also be a big help to projects that find they should have been using more, fewer, or different repositories through their history. Also, I'd guess that it would be pretty straightforward and easy to understand, plus easy to verify correctness on large examples with. -Daniel *This .sig left intentionally blank* -
That kind of thing isn't hard to do. See the scripts which create the 'JFFS2 for eCos' git tree or the 'exported kernel headers' git tree, directly from Linus' git tree. -- dwmw2 -
Search the list-archives for "git-split", that may be what you're looking for. -- best regards Ray -
Or to solve problems like Gaaah! This file we've had in the repository for the last 17 months has copyright problems and we can't distribute it! or Wouldn't it be nice to permanently include all that old Linux history that's currently grafted onto the "real" history? In other words, general history rewriting, but fast. (Disclaimer: I've never tried to use the history rewrite tool that Cogito has, so I don't know its limitations, or how fast it is.) -- Karl Hasselstr
Hi, On Wed, 7 Feb 2007, Karl Hasselstr
Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > On Wed, 7 Feb 2007, Karl Hasselstr
Hi, > > On Wed, 7 Feb 2007, Karl Hasselstr
Yeah, I think fast-import is great. And I'd also like to echo that call to not call it "gfi". Maybe it's just me, and maybe it's just because I'm a home-owner who does things like add in-wall ethernet cables, but to me, gfi is about an electrical outlet. So to me, gfi means "ground fault interrupter": the kind of outlet that breaks the circuit if there is current leaking to the ground pin. All your electrical outlets in "wet areas" (bathroom, kitchen within a certain distance of a sink, outside, near swimming pools etc) are supposed to be GFI's. I realize that there's not a lot of chance of confusion in the git world, but still. Linus -
OK. There happen to be 78 uses of `gfi` in the manpage. I'll correct the spelling to fast-import. :-) -- Shawn. -
Didn't you listen to what Linus said? Near porcelain and plumbing is precisely where you _need_ gfi! -- Karl Hasselstr
On Fri, 9 Feb 2007, Karl Hasselstr
Heh. I was actually sort of thinking of renaming it git-gfi. :) git-fast-import is just too long to write. And for some reason I have been writing it a lot lately. #git, email, git-fast-export's manual page (which is now also the largest manual page in all of Git!). But of course the better name is git-fast-import. Stealing a three-letter non-hypen-containing name for a tool the user never is meant to run by hand is just evil. I haven't even tried to use fast-import for general history rewriting, let alone benchmarked it against something like git-split or Cogito's rewriting tool, but I'd be willing to be that fast-import is faster. The internal ``cache'' that it uses for the tree construction is lightweight enough that gfi can probably recreate only the modified trees, compress and hash them, and output what it needs to, in the time it takes to fork+exec git-commit-tree. -- Shawn. -
I think this is quite error prone, demonstrated by the fact that we screwed that up ourselves on a few occasions. I think that the frontend should be relieved from this by letting it provide the time of change in a more natural format amongst all possible ones(like RFC2822 for example) and gfi should simply give it to parse_date(). Otherwise I think this is pretty nice. Nicolas -
This is a really good point. Its a little bit of work to switch to parse_date(); I'll try to get it done tomorrow night. -- Shawn. -
Actually, I disagree. We've traditionally have had _less_ bugs with the pure integer format than we ever had with RFC2822 format. The original (first seven days) date format inside git objects was rfc2822, and it was *horrible*. Not only does it take time to parse, people get it constantly wrong, and it's ambiguous what summer-time means etc. It's basically impossible to get anything that is totally repeatable from it, and you have to be so lax as to effectively accept even buggy input. And yes, buggy input exists. So I would strongly suggest that gfi keeps to the standard git date format which is easy to parse, and totally unambiguous. Yes, you can get it wrong, but at least then it's very clear *who* gets it wrong: it's whatever feeds data to gfi. If gfi accepts a "soft" format, you get into all these gray areas of whether you want to be strictly rfc2822 only, or whether you actually want to accept stuff that everybody accepts (including the git date functions, that try very hard to turn anything sensible into a date). And DST. And odd timezone names, etc etc. Having a hard format, set in stone, and totally unambiguous, is really a good thing. It actually ends up resulting in fewer bugs in the end, because it just makes sure that everybody is on the same page. Linus -
Hmm. Actually I think it depends on the source data. :-) If the source is only supplying RFC2822 date format and is reliable in its formatting of such, having gfi parse that rather than the frontend is probably more reliable. (Git already has a well tested date parsing routine.) But if the source is easily able to get a time_t then that is just as easily formatted out to gfi, and reading that without error is child's play. After reading your email I'm now contemplating making this a command line flag, like `--date-format=rfc2822`, so a frontend could ask gfi to use parse_date() and whatever error that might bring, or Which is why gfi is very strict about its handling of whitespace. It assumes *exactly* one space between input fields, or *exactly* one LF between commands. Anything else is assumed to be part of the next field. If spaces show up in the imported data, its the frontend that is sending stuff incorrectly. Right now however gfi is not validating the author or committer command arguments. At all. Which means that although the documentation says the format must be such-and-such, gfi doesn't care. Whatever comes in on the `author` or `committer` line is copied verbatim into the commit object. gfi probably should at least verify that the timestamp part of the line actually contains digits. :) -- Shawn. -
I'm not so worried about the git date parsing routines (which are fairly solid) as about the fact that absolutely *tons* of people get rfc2822 wrong. And we'd never even see it, because gits date-parsing routines are very forgiving, and allow pretty much anything (and no, I'm not talking about approxidate(), which really *does* allow anything, I'm talking about the "strict" date parser). They allow pretty much any half-way valid date, exactly because people don't do rfc2822 right anyway (and because they are also meant to work even if you write the date by hand, like "12:34 2005-06-07"). And *particularly* when it comes to timezones, it just guesses. The whole daylight savings time thing is just too hard. And if no timezone exists, it will just take the current one, so things may *seem* like they work, but then two different people importing the *same* archive in two different locations will actually get different results! THAT'S A BAD THING! It's much better to specify the date so exactly that you simply cannot get different results with the same input. Sure, you can still mess up the program that actually generates the data for gfi, and have bugs like that *there*, but at least they'd have to think a bit about it. And the TZ problem is actually less likely if you have a strict TZ format. For example, when importing from CVS, the natural thing to do is to just always set TZ to +0000. Which gets you something reliable, and it won't depend on who did the import. But hey, especially if it's a flag, and especially if it's *documented* that the date parsing will depend on the current timezone etc, then maybe it's all ok. It's certainly convenient to be able to give the date in any format. It's just very easy to get bugs when you allow any random crud.. Linus -
Well, exactly because GIT already has fairly solid date parsing routines, and the fact that we needed solid date parsing routines in the first place, exactly because people don't do rfc2822 right anyway, should be a hell of a big clue why we should parse date information for the gfi frontend. Because the date is for sure most likely in a screwed up format already and it is counter productive to have to deal with that in a duplicated piece of code. And the bare reality is that people will just not care to parse it right themselves. Quoting from the gfi manual: |A typical frontend for gfi tends to weigh in at approximately 200 |lines of Perl/Python/Ruby code. Most developers have been able to |create working importers in just a couple of hours, even though it |is their first exposure to gfi, and sometimes even to Git. This is |an ideal situation, given that most conversion tools are throw-away |(use once, and never look back). This is therefore a damn good idea if gfi can make things right out of crap because frontends will not get much attention after the first "hey it works" level. And the GIT date format, albeit being perfectly unambigous, is not inline with the statement above. With the GIT date format a conversion _will_ be necessary in the frontend, while if gfi shove it to parse_date() instead then no conversion is even likely to be needed by the frontend. I'd much prefer if frontend writers didn't have to care (and most probably manage to botch it if they have to) about date conversion. We even botched it a few times ourselves despite the fact that we're damn good. And because our date parsing code is damn good (hey we're just damn good aren't we?) I would bet that there will be much less conversion errors if gfi used parse_date() on provided data than if the frontend tries to parse the date itself. This is wat we feed email submission through everyday anyway, so we must trust it to do a good job for imports as well. Nicolas -
Nevertheless, they _should_. The principle is simple -- wherever there is ambiguity, you should seek to resolve that as _close_ to the point of origin as possible. Your 'best guess' gets worse and worse the further you go from the source of the data. If you're exporting from a legacy repository in one part of the world, then transferring the raw data to a machine elsewhere to be imported into git, you _really_ want to be making your guesses about timezones and character sets in the _export_ stage; not the subsequent import. So there's a lot to be said for nailing down gfi's intermediate format and removing _all_ the ambiguity from it -- using git format dates (which I did that way precisely for the lack of ambiguity), and using UTF-8 (or some other _specified_ but not assumed character set). -- dwmw2 -
Done. I just pushed a change to gfi which adds `--date-format=<fmt>`. For <fmt> you have the choice of: raw: Standard Git format. This is the default, as its what the existing frontends by Chris Lee, Simon Hausmann, Jon Smirl, and Simon 'corecode' Schubert expect. rfc2822: Run whatever crap you give us through parse_date(), and cross your fingers. If parse_date() returns < 0 we bomb out, but otherwise take it at its word. now: This is a toy, but useful if you really want now, dammit. We just call datestamp() and tack that in. Note that the frontend must also supply the literal string `now` in the committer line (e.g. "committer A U Thor <at@example.com> now") to prevent us from bombing out. The last one will probably get more useful when I fix gfi so it can safely commit against active refs without losing commits (make it do a strict fast-forward check before updating). In this case it may be useful for something like git-cvsserver, as it avoids the need for a temporary directory, index, etc. -- Shawn. -
I think you should call it something else than rfc2822. Because parse_date() accepts much more than just rfc2822. What about "cooked"? Nicolas -
It does accept a lot more than that, but straying away from rfc2822 gets into the grey areas of parse_date(). E.g. it matches crap such as 'yyyy-mm-dd' or 'yyyy-dd-mm'. But that is completely ambiguous! I don't really want to advertise that it is accepting non-RFC 2822 input here. I was thinking of doing an `iso` (yyyy-mm-dd hh:mm:ss) format, which may just defer into parse_date(), but again encourage the frontend to *only* feed that ISO style format. -- Shawn. -
OK that makes sense. Nicolas -
Well, if it doesn't build then just don't make it a fatal build error. That won't be worse than not having it included at all. And if it compiles then consider it as a bonus! Nicolas -
| Pardo | Re: pthread_create() slow for many threads; also time to revisit 64b context switc... |
| Paul Jackson | Inquiry: Should we remove "isolcpus= kernel boot option? (may have realtime uses) |
| Srivatsa Vaddagiri | Re: [PATCH, RFC] reimplement flush_workqueue() |
| Peter Zijlstra | Re: Btrfs v0.16 released |
git: | |
| Giuseppe Bilotta | Re: gitweb and remote branches |
| Miklos Vajna | [rfc] git submodules howto |
| JD Guzman | C# Git Implementation |
| Junio C Hamano | Re: [PATCH] fix parallel make problem |
| Richard Stallman | Real men don't attack straw men |
| Steve B | SSH brute force attacks no longer being caught by PF rule |
| GVG GVG | ssh_exchange_identification: Connection closed by remote host |
| Marius ROMAN | 1440x900 resolution problem |
| Tomasz Grobelny | [PATCH 0/5] [DCCP]: Queuing policies |
| Dushan Tcholich | Re: ksoftirqd high cpu load on kernels 2.6.24 to 2.6.27-rc1-mm1 |
| John Heffner | Re: A Linux TCP SACK Question |
| Denys Fedoryshchenko | Re: Could you make vconfig less stupid? |
