login
Header Space

 
 

Re: git-fast-import

Previous thread: [PATCH] Read cvsimport options from repo-config by James Bowes on Monday, February 5, 2007 - 9:22 pm. (4 messages)

Next thread: [PATCH 1/2] bash: Support git-rebase -m continuation completion. by Shawn O. Pearce on Monday, February 5, 2007 - 10:37 pm. (1 message)
To: <git@...>
Date: Monday, February 5, 2007 - 10:31 pm

I'm starting to get gfi (git-fast-import) prepared for a merge into
the main git.git tree.  For those who don't know, gfi is the result
of my work with Jon Smirl on trying to *quickly* import the massive
Mozilla CVS repository into Git.  Recently its been getting a lot
of attention from the KDE, OOo, Dragonfly BSD, and Qt projects.

When exactly we merge it in will depend a lot on Junio.  It should
be safe to merge before 1.5.0 as its strictly new source files,
but we may still want to wait until after 1.5.0 is out.

I'm mainly worried about breaking compliation on odd architectures.
gfi builds, runs and has been used for production level imports
on Mac OS X, Linux and Dragonfly BSD, using both 32 bit and 64 bit
architectures, but some of Git's other targets (e.g. AIX) haven't
seen any testing.

The gfi code is quite stable and has been getting a lot of use
(and discussion) lately.  A new test (t/t9300-fast-import.sh)
has been added and now, finally, documentation
(Documentation/git-fast-import.txt).

As gfi is 1962 lines of C and its development history consists of
74 commits made over the span of 7 months (first commit was Aug 5,
2006) and several versions of core Git code (which gfi calls into,
and which has gone through some non-trivial changes during that
time), I'm going to ask Junio to directly pull the topic branch
into git.git, rather than submitting it as patches.

My topic branch is published on repo.or.cz (thanks Pasky!).  I would
encourage all parties who would have otherwise been interested in
reviewing the patches on the mailing list to clone/fetch the topic
and review it locally instead.

	gitweb: http://repo.or.cz/w/git/fastimport.git
	clone:  git://repo.or.cz/git/fastimport.git

I'm particularly interested in feedback on the documentation,
so I am attaching it below.

-------
git-fast-import(1)
==================

NAME
----
git-fast-import - Backend for fast Git data importers.


SYNOPSIS
--------
frontend | 'git-fast-import' [options]

DESCRIP...
To: Shawn O. Pearce <spearce@...>
Cc: <git@...>
Date: Tuesday, February 6, 2007 - 9:50 am

Compilation errors are the simplest to fix, just send it in.
I have to import lots of data from perforce spaghetti, so I'm very
likely to try it out.
-
To: Alex Riesen <raa.lkml@...>
Cc: <git@...>
Date: Tuesday, February 6, 2007 - 1:43 pm

True.

But it really is annoying when you download the latest-and-greatest
release of a package only to find out it doesn't compile on your
OS of choice, and even worse when you find out it is because of
new code that you will never use which was added in just before

I can't help you with spaghetti, but the Qt folks did make their
Perforce importer available.  Chris Lee put it in the fast-export
project on repo.or.cz.  Its a relatively short Python program.
Might help you get started.

They created annotated tags (with no message) for every p4 changeset.
I think its just because they didn't realize you can use (abuse?) the
`reset` command in gfi to create lightweight tags instead.


I actually implemented a "data &lt;path" command in gfi to tell gfi
to load data from a file, for this type of case where the foreign
system has dropped the files in your working directory and you just
want Git to read them.

But there's no synchronization between gfi and the frontend (aside
from the pipe buffer throttling the frontend), so there is no way
for the frontend to know that gfi has finished a batch of files
and its safe to ask p4 for the next revision.

So I threw it away.  It was only a 10 line patch anyway.  :)

-- 
Shawn.
-
To: Shawn O. Pearce <spearce@...>
Cc: <git@...>
Date: Tuesday, February 6, 2007 - 2:02 pm

Yes, I saw their code. That's how I started thinking of using gfi

I found it's useless to do anything with p4 changes. They lack
the most important part of history: parent. The comments get
useless too, because they refer to the most recent change,
with no practical way to extract anything in between. Not much
of a problem, nobody writes anything sensible in perforce
comments anyway.
-
To: <git@...>
Cc: Shawn O. Pearce <spearce@...>
Date: Tuesday, February 6, 2007 - 5:28 am

Is &lt;tz&gt; /really/ expressed in minutes?  500 minutes is 8 hours 20 minutes.

I know what you mean, of course; and so would anyone reading it - so I suggest 
just dropping the ", in minutes" - as it's not true.


Andy
-- 
Dr Andy Parkins, M Eng (hons), MIEE
andyparkins@gmail.com
-
To: Andy Parkins <andyparkins@...>
Cc: <git@...>, Shawn O. Pearce <spearce@...>
Date: Tuesday, February 6, 2007 - 12:37 pm

Agreed. It _is_ "in minutes", but it's in an oddish human-readable base-60 
format. It's certainly *not* decimal, it's more like "two decimal digits 
encode each base-60 digit in the obvious way".

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Andy Parkins <andyparkins@...>, <git@...>
Date: Tuesday, February 6, 2007 - 12:44 pm

What about this language?

	The time of the change is specified by `&lt;time&gt;` as the number of
	seconds since the UNIX epoc (midnight, Jan 1, 1970, UTC) and is
	written in base-10 notation using US-ASCII digits.  The committer's
	timezone is specified by `&lt;tz&gt;` as a positive or negative offset
	from UTC.  For example EST (which is typically 5 hours behind GMT)
	would be expressed in `&lt;tz&gt;` by ``-0500'' while GMT is ``+0000''.

-- 
Shawn.
-
To: Shawn O. Pearce <spearce@...>
Cc: Linus Torvalds <torvalds@...>, Andy Parkins <andyparkins@...>, <git@...>
Date: Wednesday, February 7, 2007 - 12:45 am

EST is always 5 hours behind GMT. During the summer, EST is still 5 hours 
behind GMT, but the clocks which use ET are set to EDT (-0400) instead. 

	-Daniel
*This .sig left intentionally blank*
-
To: Shawn O. Pearce <spearce@...>
Cc: Linus Torvalds <torvalds@...>, Andy Parkins <andyparkins@...>, <git@...>
Date: Tuesday, February 6, 2007 - 9:17 pm

Shawn O. Pearce &lt;spearce@spearce.org&gt; wrote:


That is /not/ a timezone! Maybe an offset from UTC.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                    Fono: +56 32 2654431
Universidad Tecnica Federico Santa Maria             +56 32 2654239
Casilla 110-V, Valparaiso, Chile               Fax:  +56 32 2797513
-
To: Horst H. von Brand <vonbrand@...>
Cc: Linus Torvalds <torvalds@...>, Andy Parkins <andyparkins@...>, <git@...>
Date: Wednesday, February 7, 2007 - 1:46 am

Indeed.  Thank you for the correction.  I'll push out fixed docs
shortly.

-- 
Shawn.
-
To: Horst H. von Brand <vonbrand@...>
Cc: Shawn O. Pearce <spearce@...>, Andy Parkins <andyparkins@...>, <git@...>
Date: Tuesday, February 6, 2007 - 10:50 pm

Btw, one thing that might be a good idea to document very clearly:

 - in the native git format, the offset from UTC has *nothing* to do with 
   the actual time itself. The time in native git is always in UTC, and 
   the offset from UTC does not change "time" - it's purely there to tell 
   in which timezone the event happened.

   So 12345678 +0000 and 12345678 -0700 are *exactly*the*same*date*, 
   except event one happened in UTC, and the other happened in UTC-7.

 - in rfc2822 format, the offset from UTC actually *changes* the date. The 
   date "Oct 12, 2006 20:00:00" will be two _different_ times when you say 
   it is in PST or in UTC.

And yes, for all I know we might get this wrong inside git too. It's easy 
to get confused, because they really do mean different things.

For an example of this, do

	make test-date

in git (which parses the argument using the "exact date" and "approxidate" 
versions respectively, and the exact date parsing will give the internal 
git representation on the first line in the middle column), and then:

	./test-date "1234567890 -0800"
	./test-date "1234567890 +0000"

and then try

	./test-date "Fri Feb 13 15:31:30 2009 PST"
	./test-date "Fri Feb 13 15:31:30 2009 UTC"

and notice how the first two (numeric) dates that differ in UTC offset 
will still return the exact same seconds since the epoch:

	1234567890 -0800
	1234567890 +0000

but the second example (with a rfc2822-like date), will show how the 
seconds-since-epoch changes, and gives:

	1234567890 -0800
	1234539090 +0000

respectively for those two dates.

Logical? It actually is, but you have to understand how git represents 
date to see the logic. To git, the "timezone" is really totally 
irrelevant. It doesn't really affect the "date" at all. At most, it 
affects how you _print_ the date, and you can tell what timezone the 
computer was set to when the commit was made.

And yes, I would not be at all surprised if we had some bug here where we 
got it wrong...
To: Linus Torvalds <torvalds@...>
Cc: Horst H. von Brand <vonbrand@...>, Shawn O. Pearce <spearce@...>, Andy Parkins <andyparkins@...>, <git@...>
Date: Thursday, February 8, 2007 - 5:34 pm

Hi,


FWIW I just grepped git for tz, and looked at the results. The place I had 
to think a bit more about was in builtin-blame.c:format_time(). Probably a 
special date format is needed to stay compatible with cvsserver, otherwise 
show_date() or even show_rfc2822_date() could be used.

The code actually adds the timezone in minutes to the timestamp, and then 
calls gmtime() to be able to format the date with strftime() (something 
similar, without strftime() is done in show_[rfc2822_]date()). The result 
is correct AFAICT, although it would be cleaner IMHO to add yet another 
function to date.c which formats the time according to cvsserver's wishes.

Post 1.5.0.

Ciao,
Dscho

-
To: Linus Torvalds <torvalds@...>
Cc: Horst H. von Brand <vonbrand@...>, Andy Parkins <andyparkins@...>, <git@...>
Date: Wednesday, February 7, 2007 - 1:53 am

Here is the current language relating to date parsing in gfi:

Date Formats
~~~~~~~~~~~~
The following date formats are supported.  A frontend should select
the format it will use for this import by passing the format name
in the `--date-format=&lt;fmt&gt;` command line option.

`raw`::
	This is the Git native format and is `&lt;time&gt; SP &lt;offutc&gt;`.
	It is also gfi's default format, if `--date-format` was
	not specified.
+
The time of the event is specified by `&lt;time&gt;` as the number of
seconds since the UNIX epoch (midnight, Jan 1, 1970, UTC) and is
written as an ASCII decimal integer.
+
The local offset is specified by `&lt;offutc&gt;` as a positive or negative
offset from UTC.  For example EST (which is 5 hours behind UTC)
would be expressed in `&lt;tz&gt;` by ``-0500'' while UTC is ``+0000''.
The local offset does not affect `&lt;time&gt;`; it is used only as an
advisement to help formatting routines display the timestamp.
+
If the local offset is not available in the source material, use
``+0000'', or the most common local offset.  For example many
organizations have a CVS repository which has only ever been accessed
by users who are located in the same location and timezone.  In this
case the offset from UTC can be easily assumed.
+
Unlike the `rfc2822` format, this format is very strict.  Any
variation in formatting will cause gfi to reject the value.

`rfc2822`::
	This is the standard email format as described by RFC 2822.
+
An example value is ``Tue Feb 6 11:22:18 2007 -0500''.  The Git
parser is accurate, but a little on the lenient side.  Its the
same parser used by gitlink:git-am[1] when applying patches
received from email.
+
Some malformed strings may be accepted as valid dates.  In some of
these cases Git will still be able to obtain the correct date from
the malformed string.  There are also some types of malformed
strings which Git will parse wrong, and yet consider valid.
Seriously malformed strings will be rejected.
+
Unlike the `raw`...
To: Shawn O. Pearce <spearce@...>
Cc: Linus Torvalds <torvalds@...>, Horst H. von Brand <vonbrand@...>, Andy Parkins <andyparkins@...>, <git@...>
Date: Wednesday, February 7, 2007 - 6:18 pm

Say what? If I use the "raw" format with UTC offset, the offset is just


"already uses Unix-epoch format, can be coaxed to give dates in that


Better fix that. It can't be that costly to call gettimeofday(2) once and

See?
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                    Fono: +56 32 2654431
Universidad Tecnica Federico Santa Maria             +56 32 2654239
Casilla 110-V, Valparaiso, Chile               Fax:  +56 32 2797513

-
To: Horst H. von Brand <vonbrand@...>
Cc: Shawn O. Pearce <spearce@...>, Andy Parkins <andyparkins@...>, <git@...>
Date: Wednesday, February 7, 2007 - 6:39 pm

The offset that git maintaines is basically always ignored by git except 
for pure printout purposes.

For example, when you traverse commits, git normally picks the next 
reachable commit to show by using the date. The UTC offset has no effect 
on anything.

In fact, when we parse a commit, we don't even *parse* the timezone info. 
Look in commit.c: parse_commit_date. The timezone really doesn't even 
exist as far as any "real" git operation is concerned. It's just saved 
away, and it's _shown_ in "git log", but it has no real meaning apart from 
that.

So git very much only works on UTC time internally, and the only thing 
that actually matters in a string like "1234567890 -0700" is the first 
part. The "-0700" is _literally_ just a comment that is only ever even 
parsed by "pretty_print_commit()".

Btw, CVS doesn't have any TZ info at all, so CVS also internally always 
saves in UTC. It then tends to print out logs in whatever timezone you 
happen to be in at the time of printout, afaik. 

		Linus
-
To: Shawn O. Pearce <spearce@...>
Cc: Linus Torvalds <torvalds@...>, Horst H. von Brand <vonbrand@...>, Andy Parkins <andyparkins@...>, <git@...>
Date: Wednesday, February 7, 2007 - 5:21 am

Should be "It's" or "It is".

-- 
Karl Hasselstr
To: Shawn O. Pearce <spearce@...>
Cc: Andy Parkins <andyparkins@...>, <git@...>
Date: Tuesday, February 6, 2007 - 1:24 pm

I doubt it would confuse anybody. Although usually we'd not say

	"in base-10 notation using US-ASCII digits"

the normal way to do that is to just saying "as an ASCII decimal integer".

Sure, people could try to do "10,200,300" and claim it's "decimal 
integer", but at that point, you can just tell them they're crazy, and 
ignore them ;)

But your text certainly isn't wrong. I just think it overspecifies a bit, 
at the expense of readability.

		Linus
-
To: Andy Parkins <andyparkins@...>
Cc: <git@...>
Date: Tuesday, February 6, 2007 - 5:40 am

Heh, right you are!

Nico's point about using parse_date() here is a really good one.
I'm going to modify that section of gfi to use parse_date(), which
would change the language here anyway.  I'll try to not to make a
silly mistake such as the above in the updated docs.  :)

-- 
Shawn.
-
To: <git@...>
Date: Tuesday, February 6, 2007 - 2:12 am

Do we have example frontend  that can be added along with gfi ?

-aneesh
-
To: Aneesh Kumar K.V <aneesh.kumar@...>
Cc: <git@...>
Date: Tuesday, February 6, 2007 - 2:18 am

Not yet.  Some frontends are available here on repo.or.cz:

  gitweb: http://repo.or.cz/w/fast-export.git
  clone:  git://repo.or.cz/fast-export.git

But both lack branch support, for example, so they probably aren't
nearly as complete as the existing non-gfi based importers.

-- 
Shawn.
-
To: Shawn O. Pearce <spearce@...>
Cc: Aneesh Kumar K.V <aneesh.kumar@...>, <git@...>
Date: Wednesday, February 7, 2007 - 12:55 am

It might be nice to have a git-fast-export, which could actually be 
potentially useful for generating a repository with systematic differences 
from the original. (E.g., to make a repository of git's Documentation 
directory, with just the commits that affect it)

That might also be a big help to projects that find they should have been 
using more, fewer, or different repositories through their history.

Also, I'd guess that it would be pretty straightforward and easy to 
understand, plus easy to verify correctness on large examples with.

	-Daniel
*This .sig left intentionally blank*
-
To: Daniel Barkalow <barkalow@...>
Cc: Shawn O. Pearce <spearce@...>, Aneesh Kumar K.V <aneesh.kumar@...>, <git@...>
Date: Wednesday, February 7, 2007 - 9:38 am

That kind of thing isn't hard to do. See the scripts which create the
'JFFS2 for eCos' git tree or the 'exported kernel headers' git tree,
directly from Linus' git tree.

-- 
dwmw2

-
To: 'Daniel Barkalow' <barkalow@...>, 'Shawn O. Pearce' <spearce@...>
Cc: 'Aneesh Kumar K.V' <aneesh.kumar@...>, <git@...>
Date: Wednesday, February 7, 2007 - 5:29 am

Search the list-archives for "git-split", that may be what you're looking
for.

-- 
best regards

  Ray

-
To: Daniel Barkalow <barkalow@...>
Cc: Shawn O. Pearce <spearce@...>, Aneesh Kumar K.V <aneesh.kumar@...>, <git@...>
Date: Wednesday, February 7, 2007 - 5:13 am

Or to solve problems like

  Gaaah! This file we've had in the repository for the last 17 months
  has copyright problems and we can't distribute it!

or

  Wouldn't it be nice to permanently include all that old Linux
  history that's currently grafted onto the "real" history?

In other words, general history rewriting, but fast.

(Disclaimer: I've never tried to use the history rewrite tool that
Cogito has, so I don't know its limitations, or how fast it is.)

-- 
Karl Hasselstr
To: Karl <kha@...>
Cc: Daniel Barkalow <barkalow@...>, Shawn O. Pearce <spearce@...>, Aneesh Kumar K.V <aneesh.kumar@...>, <git@...>
Date: Wednesday, February 7, 2007 - 7:17 am

Hi,

On Wed, 7 Feb 2007, Karl Hasselstr
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Karl <kha@...>, Daniel Barkalow <barkalow@...>, Aneesh Kumar K.V <aneesh.kumar@...>, <git@...>
Date: Wednesday, February 7, 2007 - 6:55 pm

Johannes Schindelin &lt;Johannes.Schindelin@gmx.de&gt; wrote:
&gt; On Wed, 7 Feb 2007, Karl Hasselstr
To: Shawn O. Pearce <spearce@...>
Cc: Karl <kha@...>, Daniel Barkalow <barkalow@...>, Aneesh Kumar K.V <aneesh.kumar@...>, <git@...>
Date: Wednesday, February 7, 2007 - 7:55 pm

Hi,

&gt; &gt; On Wed, 7 Feb 2007, Karl Hasselstr
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Shawn O. Pearce <spearce@...>, <kha@...>, Daniel Barkalow <barkalow@...>, Aneesh Kumar K.V <aneesh.kumar@...>, <git@...>
Date: Thursday, February 8, 2007 - 12:56 pm

Yeah, I think fast-import is great. And I'd also like to echo that call to 
not call it "gfi". Maybe it's just me, and maybe it's just because I'm a 
home-owner who does things like add in-wall ethernet cables, but to me, 
gfi is about an electrical outlet.

So to me, gfi means "ground fault interrupter": the kind of outlet that 
breaks the circuit if there is current leaking to the ground pin. All your 
electrical outlets in "wet areas" (bathroom, kitchen within a certain 
distance of a sink, outside, near swimming pools etc) are supposed to be 
GFI's.

I realize that there's not a lot of chance of confusion in the git world, 
but still.

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Karl <kha@...>, Daniel Barkalow <barkalow@...>, Aneesh Kumar K.V <aneesh.kumar@...>, <git@...>
Date: Thursday, February 8, 2007 - 3:10 pm

OK.  There happen to be 78 uses of `gfi` in the manpage.
I'll correct the spelling to fast-import.  :-)

-- 
Shawn.
-
To: Shawn O. Pearce <spearce@...>
Cc: Linus Torvalds <torvalds@...>, Johannes Schindelin <Johannes.Schindelin@...>, Daniel Barkalow <barkalow@...>, Aneesh Kumar K.V <aneesh.kumar@...>, <git@...>
Date: Friday, February 9, 2007 - 4:49 am

Didn't you listen to what Linus said? Near porcelain and plumbing is
precisely where you _need_ gfi!

-- 
Karl Hasselstr
To: <kha@...>
Cc: Shawn O. Pearce <spearce@...>, Johannes Schindelin <Johannes.Schindelin@...>, Daniel Barkalow <barkalow@...>, Aneesh Kumar K.V <aneesh.kumar@...>, <git@...>
Date: Friday, February 9, 2007 - 11:47 am

On Fri, 9 Feb 2007, Karl Hasselstr
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Karl <kha@...>, Daniel Barkalow <barkalow@...>, Aneesh Kumar K.V <aneesh.kumar@...>, <git@...>
Date: Wednesday, February 7, 2007 - 8:12 pm

Heh.  I was actually sort of thinking of renaming it git-gfi.  :)

git-fast-import is just too long to write.  And for some reason I
have been writing it a lot lately.  #git, email, git-fast-export's
manual page (which is now also the largest manual page in all
of Git!).

But of course the better name is git-fast-import.  Stealing a
three-letter non-hypen-containing name for a tool the user never
is meant to run by hand is just evil.


I haven't even tried to use fast-import for general history
rewriting, let alone benchmarked it against something like git-split
or Cogito's rewriting tool, but I'd be willing to be that fast-import
is faster.  The internal ``cache'' that it uses for the tree
construction is lightweight enough that gfi can probably recreate
only the modified trees, compress and hash them, and output what
it needs to, in the time it takes to fork+exec git-commit-tree.

-- 
Shawn.
-
To: Shawn O. Pearce <spearce@...>
Cc: <git@...>
Date: Tuesday, February 6, 2007 - 12:06 am

I think this is quite error prone, demonstrated by the fact that we 
screwed that up ourselves on a few occasions.  I think that the frontend 
should be relieved from this by letting it provide the time of change in 
a more natural format amongst all possible ones(like RFC2822 for 
example) and gfi should simply give it to parse_date().

Otherwise I think this is pretty nice.


Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: <git@...>
Date: Tuesday, February 6, 2007 - 1:48 am

This is a really good point.  Its a little bit of work to switch
to parse_date(); I'll try to get it done tomorrow night.

-- 
Shawn.
-
To: Shawn O. Pearce <spearce@...>
Cc: Nicolas Pitre <nico@...>, <git@...>
Date: Tuesday, February 6, 2007 - 12:35 pm

Actually, I disagree. We've traditionally have had _less_ bugs with the 
pure integer format than we ever had with RFC2822 format.

The original (first seven days) date format inside git objects was 
rfc2822, and it was *horrible*. Not only does it take time to parse, 
people get it constantly wrong, and it's ambiguous what summer-time means 
etc. It's basically impossible to get anything that is totally repeatable 
from it, and you have to be so lax as to effectively accept even buggy 
input. And yes, buggy input exists.

So I would strongly suggest that gfi keeps to the standard git date format 
which is easy to parse, and totally unambiguous. Yes, you can get it 
wrong, but at least then it's very clear *who* gets it wrong: it's 
whatever feeds data to gfi. If gfi accepts a "soft" format, you get into 
all these gray areas of whether you want to be strictly rfc2822 only, or 
whether you actually want to accept stuff that everybody accepts 
(including the git date functions, that try very hard to turn anything 
sensible into a date). And DST. And odd timezone names, etc etc.

Having a hard format, set in stone, and totally unambiguous, is really a 
good thing. It actually ends up resulting in fewer bugs in the end, 
because it just makes sure that everybody is on the same page.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Nicolas Pitre <nico@...>, <git@...>
Date: Tuesday, February 6, 2007 - 12:56 pm

Hmm.  Actually I think it depends on the source data.  :-)

If the source is only supplying RFC2822 date format and is reliable
in its formatting of such, having gfi parse that rather than
the frontend is probably more reliable.  (Git already has a well
tested date parsing routine.)  But if the source is easily able
to get a time_t then that is just as easily formatted out to gfi,
and reading that without error is child's play.

After reading your email I'm now contemplating making this a command
line flag, like `--date-format=rfc2822`, so a frontend could ask
gfi to use parse_date() and whatever error that might bring, or

Which is why gfi is very strict about its handling of whitespace.
It assumes *exactly* one space between input fields, or *exactly*
one LF between commands.  Anything else is assumed to be part of
the next field.  If spaces show up in the imported data, its the
frontend that is sending stuff incorrectly.

Right now however gfi is not validating the author or committer
command arguments.  At all.  Which means that although the
documentation says the format must be such-and-such, gfi doesn't
care.  Whatever comes in on the `author` or `committer` line is
copied verbatim into the commit object.  gfi probably should at
least verify that the timestamp part of the line actually contains
digits.  :)

-- 
Shawn.
-
To: Shawn O. Pearce <spearce@...>
Cc: Nicolas Pitre <nico@...>, <git@...>
Date: Tuesday, February 6, 2007 - 1:20 pm

I'm not so worried about the git date parsing routines (which are fairly 
solid) as about the fact that absolutely *tons* of people get rfc2822 
wrong.

And we'd never even see it, because gits date-parsing routines are very 
forgiving, and allow pretty much anything (and no, I'm not talking about 
approxidate(), which really *does* allow anything, I'm talking about the 
"strict" date parser). 

They allow pretty much any half-way valid date, exactly because people 
don't do rfc2822 right anyway (and because they are also meant to work 
even if you write the date by hand, like "12:34 2005-06-07").

And *particularly* when it comes to timezones, it just guesses. The whole 
daylight savings time thing is just too hard. And if no timezone exists, 
it will just take the current one, so things may *seem* like they work, 
but then two different people importing the *same* archive in two 
different locations will actually get different results!

THAT'S A BAD THING!

It's much better to specify the date so exactly that you simply cannot get 
different results with the same input.

Sure, you can still mess up the program that actually generates the data 
for gfi, and have bugs like that *there*, but at least they'd have to 
think a bit about it.

And the TZ problem is actually less likely if you have a strict TZ format. 
For example, when importing from CVS, the natural thing to do is to just 
always set TZ to +0000. Which gets you something reliable, and it won't 
depend on who did the import.

But hey, especially if it's a flag, and especially if it's *documented* 
that the date parsing will depend on the current timezone etc, then maybe 
it's all ok. It's certainly convenient to be able to give the date in any 
format. It's just very easy to get bugs when you allow any random crud..

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Shawn O. Pearce <spearce@...>, <git@...>
Date: Tuesday, February 6, 2007 - 2:53 pm

Well, exactly because GIT already has fairly solid date parsing 
routines, and the fact that we needed solid date parsing routines in the 
first place, exactly because people don't do rfc2822 right anyway, 
should be a hell of a big clue why we should parse date information for 
the gfi frontend.  Because the date is for sure most likely in a screwed 
up format already and it is counter productive to have to deal with that 
in a duplicated piece of code.  And the bare reality is that people will 
just not care to parse it right themselves.

Quoting from the gfi manual:

|A typical frontend for gfi tends to weigh in at approximately 200
|lines of Perl/Python/Ruby code.  Most developers have been able to
|create working importers in just a couple of hours, even though it
|is their first exposure to gfi, and sometimes even to Git.  This is
|an ideal situation, given that most conversion tools are throw-away
|(use once, and never look back).

This is therefore a damn good idea if gfi can make things right out of 
crap because frontends will not get much attention after the first "hey 
it works" level.  And the GIT date format, albeit being perfectly 
unambigous, is not inline with the statement above.

With the GIT date format a conversion _will_ be necessary in the 
frontend, while if gfi shove it to parse_date() instead then no 
conversion is even likely to be needed by the frontend.  I'd much prefer 
if frontend writers didn't have to care (and most probably manage to 
botch it if they have to) about date conversion.  We even botched it a 
few times ourselves despite the fact that we're damn good.

And because our date parsing code is damn good (hey we're just damn good 
aren't we?) I would bet that there will be much less conversion errors 
if gfi used parse_date() on provided data than if the frontend tries to 
parse the date itself.  This is wat we feed email submission through 
everyday anyway, so we must trust it to do a good job for imports as 
well.


Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: Linus Torvalds <torvalds@...>, Shawn O. Pearce <spearce@...>, <git@...>
Date: Wednesday, February 7, 2007 - 6:58 am

Nevertheless, they _should_. The principle is simple -- wherever there
is ambiguity, you should seek to resolve that as _close_ to the point of
origin as possible. Your 'best guess' gets worse and worse the further
you go from the source of the data.

If you're exporting from a legacy repository in one part of the world,
then transferring the raw data to a machine elsewhere to be imported
into git, you _really_ want to be making your guesses about timezones
and character sets in the _export_ stage; not the subsequent import.

So there's a lot to be said for nailing down gfi's intermediate format
and removing _all_ the ambiguity from it -- using git format dates
(which I did that way precisely for the lack of ambiguity), and using
UTF-8 (or some other _specified_ but not assumed character set).

-- 
dwmw2

-
To: Nicolas Pitre <nico@...>
Cc: Linus Torvalds <torvalds@...>, <git@...>
Date: Tuesday, February 6, 2007 - 4:09 pm

Done.  I just pushed a change to gfi which adds `--date-format=&lt;fmt&gt;`.
For &lt;fmt&gt; you have the choice of:

  raw: Standard Git format.  This is the default, as its what
  the existing frontends by Chris Lee, Simon Hausmann, Jon Smirl,
  and Simon 'corecode' Schubert expect.

  rfc2822: Run whatever crap you give us through parse_date(),
  and cross your fingers.  If parse_date() returns &lt; 0 we bomb
  out, but otherwise take it at its word.

  now: This is a toy, but useful if you really want now, dammit.
  We just call datestamp() and tack that in.  Note that the frontend
  must also supply the literal string `now` in the committer line
  (e.g. "committer A U Thor &lt;at@example.com&gt; now") to prevent us
  from bombing out.

The last one will probably get more useful when I fix gfi so it can
safely commit against active refs without losing commits (make it
do a strict fast-forward check before updating).  In this case it
may be useful for something like git-cvsserver, as it avoids the
need for a temporary directory, index, etc.
 
-- 
Shawn.
-
To: Shawn O. Pearce <spearce@...>
Cc: Linus Torvalds <torvalds@...>, <git@...>
Date: Tuesday, February 6, 2007 - 5:03 pm

I think you should call it something else than rfc2822.  Because 
parse_date() accepts much more than just rfc2822.  What about "cooked"?


Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: Linus Torvalds <torvalds@...>, <git@...>
Date: Tuesday, February 6, 2007 - 5:15 pm

It does accept a lot more than that, but straying away from rfc2822
gets into the grey areas of parse_date().  E.g. it matches crap such
as 'yyyy-mm-dd' or 'yyyy-dd-mm'.  But that is completely ambiguous!

I don't really want to advertise that it is accepting non-RFC 2822
input here.  I was thinking of doing an `iso` (yyyy-mm-dd hh:mm:ss)
format, which may just defer into parse_date(), but again encourage
the frontend to *only* feed that ISO style format.

-- 
Shawn.
-
To: Shawn O. Pearce <spearce@...>
Cc: Linus Torvalds <torvalds@...>, <git@...>
Date: Tuesday, February 6, 2007 - 5:42 pm

OK that makes sense.


Nicolas
-
To: Shawn O. Pearce <spearce@...>
Cc: <git@...>
Date: Monday, February 5, 2007 - 11:18 pm

Well, if it doesn't build then just don't make it a fatal build error.  
That won't be worse than not having it included at all.
And if it compiles then consider it as a bonus!


Nicolas
-
Previous thread: [PATCH] Read cvsimport options from repo-config by James Bowes on Monday, February 5, 2007 - 9:22 pm. (4 messages)

Next thread: [PATCH 1/2] bash: Support git-rebase -m continuation completion. by Shawn O. Pearce on Monday, February 5, 2007 - 10:37 pm. (1 message)
speck-geostationary