Re: mingw, windows, crlf/lf, and git

Previous thread: Efficiency of initial clone from server by Jon Smirl on Sunday, February 11, 2007 - 12:53 pm. (31 messages)

Next thread: [RFC] Speeding up a null fetch by Julian Phillips on Sunday, February 11, 2007 - 4:32 pm. (5 messages)
From: Mark Levedahl
Date: Sunday, February 11, 2007 - 4:13 pm

I am NOT intending to start a flamewar O:-) , so please don't turn this 
into one.

The recent threads on a mingw git port are explicit in the intent to 
provide a Windows native git. I believe there is a fundamental conflict 
here with the position, clearly stated by Linus, that git does not alter 
content in any way. Windows suffers the curse of DOS line endings (\r\n 
vs \n), and a true port to Windows *must* allow for \r\n and \n to be 
semantically the same thing as most large projects end up with a mixture 
of such files and/or are targeting cross-platform capabilities. The 
major competing solutions git seeks to supplant (cvs, cvsnt, svn, hg) 
have capability to recognize "text" files and transparently replace \r\n 
with \n on input, the reverse on output, and ignore all such differences 
on diff operations. To be relevant on native Windows, git must do the 
same. Otherwise, git will be deemed "too wierd" and dismissed in favor 
of a tool "that works."

There is no use to debating the technical merits of \r\n vs \n vs \r vs 
whatever, nor of not converting. Really. Just accept that there is a 
fundamental requirement that any version control tool on Windows be able 
to silently convert between \r\n and \n. To believe otherwise is to 
expect that the conversion be pushed elsewhere into the tool chain in 
use, and that won't happen as the competition already provide this 
conversion capability.

So, I think the git project needs to come to an explicit position on 
this, basically being:

1) git is a POSIX only tool (i.e., there will be no \r\n munging), or
2) a Windows port of git will handle and mung \r\n and \n line endings.

If the answer is 1, the mingw port is a waste of time as it simply won't 
be usable by its target audience. If the answer is 2, then I think a 
very careful design of this capability is in order.

Comments?

BTW, I have addressed this in my own world using a pre-commit script 
that converts textfile line endings into \n, recognizing that our ...
From: Johannes Schindelin
Date: Sunday, February 11, 2007 - 4:34 pm

Hi,


Agree with transformations on input and output; disagree on diff.

The problem is that it really is a transformtion. Since most Windows tools 
(at least those used in portable software) handle \n without \r quite 
well, thank you, I'd tend towards the view point: do not mess with line 
endings pre-commit/post-checkout.

Even MacOSX uses \n now, instead of \r.

Of course, for those projects which _use_ CRLF: they can continue with it. 
Git has no problem with those line endings.

The only problem CVS tried to solve (badly) was to be able to checkout 
text files on DOS, Unix _and_ MacOS. In practice, though, this use case 
does not matter anymore IMHO.

Ciao,
Dscho

-

From: Robin Rosenberg
Date: Sunday, February 11, 2007 - 5:14 pm

From: Mark Levedahl
Date: Sunday, February 11, 2007 - 7:37 pm

So, the basic design for this feature exists where? I would assume this 
would include a file mode indicator set in the blob or tree designating 
the blob is "text", along with mechanism to specify for a project what 
files are "text", along with some safety valve to check and not do 
transformation when the file does not look text-ish.

Mark

-

From: Theodore Tso
Date: Sunday, February 11, 2007 - 9:24 pm

So this is something that I've tried proposing to the Mercurial
developers, but it's never been implemented in hg.  It'll be
interesting to see what the git community thinks.  :-)

My proposal does require adding a file type to each file, as tracked
metadata, which may doom it from the start.  If you add a file type,
then you have to support mutating the file type, and some way of
handling merge conflicts (generally, picking one type or another).

Then for each file type, we implement a set of interfaces (perhaps as
simple as a series of executables named git-<type>-<operation>) which
if present, transforms the file from its live format to the canonical
format which is actually checked in and back again.  Besides using
this for the DOS CR/LF problem, it also allows for an efficient
storage of things like OpenOffice files which are a zipped set of .xml
files.  By decompressing them before pushing them into the SCM, it
means that if the user makes a tiny spelling correction in their
OpenOffice file, the delta stored in the git repository can be much
more efficiently stored (since the diff of the .xml tree will be
small, where as the diff of the entire compressed file is likely going
to be close to the entire size of the .odt file).

Another nice thing to provide for each file type would be a
pretty-printer for the diffs, so it becomes easier to see the delta
between two versions of an OpenOffice file in a textual window.

So, is this idea sane or completely insane?  Hopefully it passes
Linus's it-solves-multiple-problems-at-once test, at least.  :-)

							- Ted
-

From: David Lang
Date: Monday, February 12, 2007 - 12:28 am

there have been other things discussed that could use the 'do this on checkout' 
hooks, specificly on the issue of useing git to manage /etc the need to 
save/restore permissions requires a hook on checkout that doesn't exist yet. 
this sounds like it would solve that problem as well.

David Lang
-

From: Johannes Schindelin
Date: Monday, February 12, 2007 - 4:36 am

Hi,


I'd rather do that a la .gitignore, i.e. make this handling dependent on 
file name patterns. It is not only backwards compatible (from the 
viewpoint of the repository format), it also avoids having to specify over 

Again, I propose a slight change: Let's add a transformation driver like 
the merge driver: this allows inlining common operations like unzipping, 
CRLF->LF conversion, etc.

Ciao,
Dscho

-

From: Linus Torvalds
Date: Monday, February 12, 2007 - 10:20 am

I agree that a file-type approch would work, but I personally think it's 
too inflexible (just cr/lf vs lf? There are tons of other interesting 
issues that are valid). I also think it falls down on another (and in some 
ways much more fundamental problem): these things exist EVEN WHEN THE FILE 
ITSELF DOES NOT EXIST!

In other words, a policy about cr/lf is *not* a policy about actual 
content. It's something much more: it's a policy about representation in 
general, which includes *potential* content. It should obviously take 
effect on "git add" even with content that didn't exist before, and to 
work well, it should do so without the user having to think about it.

Equally importantly, this happens with content that was added by people 
who simply DO NOT CARE. In other words, I think a "file type" thing 
fundamentally cannot work, because under UNIX, it would be stupid and 
pointless, so any project that is maintained under UNIX might _add_ the 
file types, but since they won't matter, they'll inevitably be wrong (ie 
people forgot to mark a binary thing binary, or a text thing as text).

So: file types or attributes are broken. They cannot work well.

But enough on the negative rambling, I do have a positive and constructive 
suggestion, because I actually think I have a great model for it. But I've 
never cared enough (and since the main target would be some windows issue, 
I suspect I never really _will_ care enough) to really worry about it.

Anyway, if somebody really wants to look at this, and wants to create 
something that is actually _usable_, my suggestion is to simply extend on 
the ".gitignore" file approach. The great thing about .gitignore is that

 (a) you can track it like you track any other file

     This makes merges a *lot* easier. You see it as conflicts, you can 
     fix it up, and in general, you can use all the same tools with it as 
     you use with anything else. In contrast, explicit per-file filetypes 
     are _horrible_ for ...
From: Johannes Schindelin
Date: Monday, February 12, 2007 - 3:37 pm

Hi,

[I agree on the .gitignore approach; see my other mail in this thread]


Not so fast.

In order for this to be _useful_, you also have to have a way to _extract_ 
the text blobs. Not only for read-tree, but _also_ for diff. It makes no 
sense at all to have this transformation one-way. For diff, you _might_ 
want to have a diff beautifier (for example the .odt thing), but read-tree 
is _really_ important.

Ciao,
Dscho

-

From: Linus Torvalds
Date: Monday, February 12, 2007 - 4:02 pm

Actually, my argument is that we don't need it all that much.

For example, your "read-tree" argument is actually wrong. Anything that is 
in a tree is _already_ fixed to be '\n'. So as long as we keep to things 
like

	git diff version1..version2

we'll actually always get the right version.

Also, the index will make sure that we don't even *try* to diff normal 
checked out files.

So the only time you actually really need to test the .gitattributes file 
is when you do an "open blob in working tree". And once you do that 
function right, and just make sure both git-update-index and yes, the 
"diff against working tree" cases use it, you really should be mostly 
done.

Both git-update-index and git-diff-files want basically the same 
interface:

	struct file_buf {
		const char *buf;
		unsigned long size;
		int flags;
	}

	int read_file(const char *path, struct file_buf *);
	close_file(struct file_buf *);

and we should use that instead of the current "open + stat + mmap/read + 
close" sequences.

It really shouldn't be too nasty.

		Linus
-

From: Junio C Hamano
Date: Monday, February 12, 2007 - 3:54 pm

I agree that we can assume editors can grok files with LF
end-of-line just fine and we would not need to do the reverse
conversion on checkout paths (e.g. "read-tree -u", "checkout-index").

Textual diff generation needs to learn the CRLF-to-LF conversion
in diff_populate_filespec(); this needs to be done even when the
caller wants size_only.

Oops.


Not me.

-

From: Junio C Hamano
Date: Monday, February 12, 2007 - 4:02 pm

If I were to do this, I would say the cache should store the
size on the filesystem in stat fields.  Which means that the
object name recorded is text blob _after_ line endings are
normalized to LF, and its exploded size does not necessarily
match the cached size.

So this means that whoever does the diff_populate_filespec()
change needs to be careful, but it is not such a big deal.

-

From: Linus Torvalds
Date: Monday, February 12, 2007 - 4:09 pm

Umm. There's two (very distinct) uses for st_size.

The one that we actually use to validate the current index obviously must 
match the "OS returned value". It contains all the CR/LF stuff.

The one where we actually read the file and run SHA1 on the result must 
obviously be the post-conversion one.

But it shouldn't be a problem. We'll always know which one matters: the 
index case is always about pure stat information (and has no meaning 
outside of that, really - after all, it's no different from st_mode etc, 
and we actually keep it in a special binary format that is endian-safe!) 
and the "real object" case is always about the *data* we use to compare 
with.

I don't think we ever mix the two anyway.

		Linus
-

From: Linus Torvalds
Date: Monday, February 12, 2007 - 4:25 pm

In fact, for git-update-index, I think it's *literally* as easy as just 
changing "index_fd()" to convert the buffer on-the-fly as needed, before 
we actually call "write_sha1_file()" or "hash_sha1_file()".

So we'd just need to pass in the information about whether it's binary or 
not, and then do something like

	@@ -2091,6 +2091,10 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object, con
	 
	 	if (!type)
	 		type = blob_type;
	+#ifndef __UNIX__
	+	if (text && !strcmp(type, blob_type))
	+		convert_crlf_to_lf(&buf, &size);
	+#endif
	 	if (write_object)
	 		ret = write_sha1_file(buf, size, type, sha1);
	 	else

and that would take care of a lot of things (yeah, I'd not do it that way 
in practice, but really doesn't look that nasty - it's actually much 
nastier to have to look up the text/binary type in the first place).

Something similar looks to be true in diff generation. The core "compare 
two SHA1's at a time" doesn't need any changes, but the code that actually 
reads in the temporary file from disk obviously does. But even that is 
just _one_ point, afaik - diff_populate_filespec()":

	@@ -1362,6 +1362,10 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only)
	 		if (fd < 0)
	 			goto err_empty;
	 		s->data = xmmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0);
	+#ifndef __UNIX__
	+		if (text)
	+			convert_crlf_to_lf(&s->data, &s->size);
	+#endif
	 		close(fd);
	 		s->should_munmap = 1;
	 	}

(and again, that's not real code, it would also need to change the 
"should_munmap" flag to indicate the state of the _new_ "data" thing.

		Linus
-

From: David Lang
Date: Monday, February 12, 2007 - 4:23 pm

you could do something like this and it would deal with the srlf/lf problem, but 
if you instead put in the conversion hooks like Ted suggested then you can 
actually gain a LOT more.

his example of openoffice documents that are gziped xml files is a very good 
one. if the 'conversion' is to gunzip on checkin and gzip on checkout then the 
core git logic will work on the nice diffable xml instead of the compressed 
binary blob.

if this is extensable to arbatrary helper functions to do the conversions I'll 
bet that there are many other cases that can use this.

I think the big questions needs to be, is this helper app a filter, or can it be 
passed a filename as the destination (which would let it do things like set 
permissions on the files it creates), or should it be both?

David Lang
-

From: Johannes Schindelin
Date: Monday, February 12, 2007 - 4:24 pm

Hi,


In that case, a simple pre-commit hook would suffice.

No, the problem mentioned by Mark was a very real one: you _cannot_ rely 
on Windows' editors not to fsck up with line endings. The worst case is if 
the file contains _some_ CRLF and _some _LF_. Almost always I had the 
problem that it now converted _all_ LFs to CRLFs. Even those which already 
were converted.

So, if we are to support text mode, it is not one-way. If we do one-way, 
we really do _not_ support text mode, but pre-commit conversion to LF 
style text. And in this case, core git does not need _any_ change.

Ciao,
Dscho

-

From: Junio C Hamano
Date: Monday, February 12, 2007 - 4:42 pm

Well I disagree in two counts.

 - I do not see how you propose to solve some CRLF and some LF
   case with both-ways conversion.

 - Pre-commit hook would not be sufficient.  In a edit, diff,
   test and then commit cycle, diff and test step needs to look
   at whatever the editor left on the filesystem, so the changes
   to populate-filespec is needed to make diff part work.


   

-

From: Johannes Schindelin
Date: Monday, February 12, 2007 - 4:50 pm

Hi,



Yes, you are right.

However, since this is all post-1.5.0 (right? Right?) why not go with more 
of Ted's proposal, and make this whole mess also usable for other things 
than just crlf issues?

And I _really_ think that you do not help Windows people by doing this 
one-way thing.

Ciao,
Dscho

-

From: Mark Levedahl
Date: Monday, February 12, 2007 - 5:59 pm

Whatever is done, it needs to be robust to the notion that people will 
fail to set the correct file type somewhere. Current cvsnt is fairly 
good at autodetecting and setting text vs binary file type, and enforces 
this across all platforms, so things don't go awry too often. It is in 
my experience more reliable than subversion, which basically relies upon 
file extensions mapping to mime types to identify content. All of which 
is a very much too low standard of accuracy for a version control 
system: I lost many files per year due to the above nonsense, so I worry 
about trying to create a very general transform solution and not making 
it really, really failsafe. Having projects define individual globbing 
patterns is good, double checking the content for sanity is an absolute 
must, but I don't think that is enough. I suspect the solution should 
include round-trip conversion when creating blobs to assure that the 
input can be exactly reconstructed by the inverse transformation (and 
therefore possibly rejecting input with mixed line endings). A similar 
check could be applied on checkout.

Perhaps I'm too paranoid, but I've been burnt way too many times by 
text/binary mode stuff to let this part be trivialized. Maybe it only 
gets enabled by core.ImReallyParanoid, but I want that option.

Mark

-

From: Johannes Schindelin
Date: Monday, February 12, 2007 - 6:06 pm

Hi,


Be aware that what you proposed costs many CPU cycles. I am totally 
opposed to enabling that option by default on all platforms. I am okay 
with .gitattributes (but I would call it .gitfiletypes), but I am _not_ 
okay with git being _too much_ fscked up by Windows. Microsoft has done 
enough harm already.

Ciao,
Dscho

-

From: Shawn O. Pearce
Date: Monday, February 12, 2007 - 6:13 pm

Indeed; this type of checking should only occur if there is a filter
applied to a file.  Most files in most projects would hopefully
just be considered to be byte streams to Git, like they are today,
and thus not incur any additional overhead, beyond matching their
type to determine they are in fact just a byte stream.

The type could be cached in the index; or at least a single bit
which says "I'm just a byte stream, thanks" so that the matching
only needs to occur during an initial read-tree.

-- 
Shawn.
-

From: David Lang
Date: Monday, February 12, 2007 - 6:20 pm

for the limited case of line endings it may be reasonable to define the internal 
git format to be lf, and if you are running on a platform that uses this nativly 
no transition is needed

one possible way to make this be a general feture is to have the helper script 
have a --needed flag that tells git if it would do anything on the current 
platform or not. this way you don't need to run it (and sanity check it) if it's 
not needed.

David Lang


-

From: Mark Levedahl
Date: Monday, February 12, 2007 - 6:36 pm

I would assume that none of this crlf stuff exists at all on Linux / 
Unix / Posix, so if done right has zero impact outside of the Windows 
nuthouse. Inside that, folks are already so used to incredible slowness 
in file I/O that I'm not sure the round tripping I suggest as a check 
would be very noticeable, but in any case I fully agree it should be 
optional even there. However, if git could support something that never 
screws up, absolutely guaranteeing data integrity in the presence of 
these transforms, that would be a first in this arena and I believe a 
significant selling point.

Mark

-

From: Jeff King
Date: Monday, February 12, 2007 - 10:18 pm

There is obviously much sentiment that this should _not_ be the default
(and I agree). But if arbitrary filters are possible, then you can
theoretically write an 'autocrlf' filter which will try to do the right
thing, and you could set it for some or all files:

  echo '*: autocrlf' >.gitattributes

but it would be off by default. If we implement this, everyone has to
"pay" for .gitattributes (even if you don't use it, we have to look it
up to make sure you're not using it!), but nobody has to pay for any
filters they don't use.

-Peff
-

From: David Lang
Date: Monday, February 12, 2007 - 4:46 pm

the expectation is that the some-of-each situation is unlikly to happen if you 
convert all the time.

and if you do end up with a mixed ending file, the next time you check it in 
from a windows box it should clean it up.

David Lang
-

From: Mark Levedahl
Date: Monday, February 12, 2007 - 5:32 pm

In my work flow, I am using a pre-commit script that (among other 
things) rewrites all text files to have \n endings. This is a one-way 
conversion, and does work well for the set of tools I am using. The 
converters I use I wrote years ago, and are smart enough to deal with 
mixtures of \n, \r\n, and \r line endings in one file, transforming all 
into one unified form. d2u / u2d were not that robust when I last tried 
them (years ago), but this is an absolute necessity.

However, I don't think the one-way conversion is acceptable across the 
board. While the only Windows editor I am aware of that doesn't grok \n 
is Notepad (the moral equivalent of edlin), I suspect that undo reliance 
upon this will still lead to grief. If nothing else, someone, somewhere 
will find that their beloved crlf's are missing and will complain. 
Loudly. And in the lore, git will become known for being "wierd."  So, I 
suspect a checkout script is necessary.

Mark

-

From: Junio C Hamano
Date: Monday, February 12, 2007 - 7:02 pm

Heh, a lofty goal.  And I am glad to see that a thread full of
constructive suggestion is already going on.

So now I do not have to fear starting a flamewar; I can safely

I think there is a fundamental misconception in the above.  I do
not know about others, but to me personally, I do not see any
"seeking to supplant", nor "competition".  It's not like I or
people who raised git into the current shape are begging to
windows users to consider using git and bending backwards to
please them.  You should hone your diplomacy.

Current git may or may not match what they need, and if it does
not match what they need, making it match what they need is
primarily the responsibility of them.  If Windows users find
something in git that is interesting and useful, but if they
find something else lacking in it to be truly useful for them,
they can submit patches, or if they cannot implement the changes
themselves but only have wishlist items, then _they_ can do the
begging.

People in git community are certainly friendly and helpful
bunch, and some (including me) are unfortunate enough that
sometimes they have to touch Windows, so some degree of need is
felt to support Windows better even within the community, but it
has never been high priority.  Making it higher priority by
bringing in better ideas and starting the fire must come from

I do not think git project needs to do any such thing.  The
project evolves reflecting the needs of its users, and the
design is not decided upfront without doing any feasibility
study.  I would certainly not say our position is (1), IOW, I
would not say we will rule out Windows support.  If it can be
reasonably done without harming the code, why not?

Depending on how cleanly a change Windows users want is done
without negatively affecting the existing users, it may or may
not be judged acceptable.  We will know only when we see at
least the design and preferably the code.  I feel no need to

This is not just you, and fortunately it does not ...
From: Mark Levedahl
Date: Monday, February 12, 2007 - 8:21 pm

Junio C Hamano wrote:
 > Mark Levedahl <mlevedahl@verizon.net> writes:
 >
 >> I am NOT intending to start a flamewar O:-) , so please don't turn
 >> this into one.
 >
 > Heh, a lofty goal.  And I am glad to see that a thread full of
 > constructive suggestion is already going on.
 >
 > So now I do not have to fear starting a flamewar; I can safely
 > vent.

Junio,

I meant absolutely no offense in anything I wrote, and sincerely 
apologize if any was taken. My past experiences caused me to be 
skeptical that a significant change to accommodate a very bad design of 
Windows would be accepted here. Happily, that skepticism was misplaced. 
I am much heartened by the responses, and am optimistic a good solution 
will be found that is acceptable to all. It is very clear that the group 
is open and supportive of working through the issues to help this, and I 
intend to contribute to that solution. (If nothing else, I would like to 
be known for something besides some modest ability to hack around Tk bugs).

So, I trust the flamethrowers can remain buried.

Mark

-

From: Junio C Hamano
Date: Monday, February 12, 2007 - 11:05 pm

None taken, although I admit that I was somewhat annoyed, having
to write the first part of my response.

-

From: Alexander Litvinov
Date: Monday, February 12, 2007 - 8:32 pm

I am strongly object this statement. I develop one project under Windows and 
use Cygwin git for this. Yes, I have a problem with git's thinking line 
ending is a \n but most of troubles are diff and rebase. In general git works 
well with \r\n line endings.

When I have file that was converted from dos to unix format (or from unix to 
dos) git genereta big diff. But anyway, c++ compiler works well with both 
formats and in this case I simply convert file to dos format and git shows 
again nice diff. If unix format was commited to git I simply change the 
format and commit that file again.

The only trouble is the rebase, it does not like \r\n ending and othen produce 
unexpected merge conflict. But I don't use rebse to othen to realy 
investigate and try to solve the problem.
-

From: Johannes Schindelin
Date: Tuesday, February 13, 2007 - 3:06 am

Hi,



Well, if everybody thinks like you, maybe we do not have to change 
anything for Windows after all?

Ciao,
Dscho

-

From: Alexander Litvinov
Date: Tuesday, February 13, 2007 - 5:16 am

If you are tring to build history that looks good - you are right this is a 
I still wish to have working rebase so if git will hanle somehow \r\n it would 
be nice. But please do not produce the same behavior as cvs does: under 
cygwin it still use \n !

By the way, most windows programmers I work with says 'git is cool but is 
there gui like tortoise or wincvs ?' :-)
-

From: Johannes Schindelin
Date: Tuesday, February 13, 2007 - 5:37 am

Hi,


You really should teach format-patch to output \n patches, and keep all 

Some time ago, I started playing with a shell extension. Now that MinGW 
git is almost there, I might clean it up... Would you be interested in 
working on it, or is this just wishtalk?

Ciao,
Dscho

-

From: Linus Torvalds
Date: Tuesday, February 13, 2007 - 9:52 am

No no no.

It's going to be _horrible_ if people start interesting projects in 
Windows, and there are files in a git repository that are encoded with 
CRLF. 

I'd much rather just get this right, and that means "no hooks". If people 
start using commit hooks etc, that will just mean that they won't use them 
for all-windows environments (why use it? Everybody hass CRLF, and 
everybody _wants_ CRLF), or it will just be relatively expensive to have a 
complex hook anyway.

So I think we should plan on something like .gitattributes or similar, so 
that we _can_ handle mixed environments well, without any real setup or 
any real costs.

The costs really shouldn't be too high - we tend to avoid doing any 
expensive working tree changes *anyway*. For example, even "git checkout" 
has a huge optimization to avoid rewriting files that are already ok, so 
doing things like switching whole branches usually wouldn't even need any 
conversion for most files - even on platforms like Windows that need the 
conversion in the first place.

So considering that it looks _trivial_ for git-update-index, fairly easy 
for diff generation, and I doubt "git checkout" is really likely to be any 
worse either, this should just be somethign we do.

The *ONLY* case where we may not be able to do things automatically is 
actually a much more subtle one: "git cat-file". If we just get a SHA1, we 
don't know what the path to look it up was like, and thus we can never 
know whether it's a binary or a text object. With "-p" we can trivially 
guess, of course, but "git cat-file blob" simply must not do that!

But that really doesn't sound like a big problem to me ;)

		Linus
-

From: Linus Torvalds
Date: Tuesday, February 13, 2007 - 10:23 am

Here's a patch that I think we can merge right now. There may be other 
places that need this, but this at least points out the three places that 
read/write working tree files for git update-index, checkout and diff 
respectively. That should cover a lot of it.

Some day we can actually implement it. In the meantime, this points out a 
place for people to start. We *can* even start with a really simple "we do 
CRLF conversion automatically, regardless of filename" kind of approach, 
that just look at the data (all three cases have the _full_ file data 
already in memory) and says "ok, this is text, so let's convert to/from 
DOS format directly".

THAT somebody can write in ten minutes, and it would already make git much 
nicer on a DOS/Windows platform, I suspect.

And it would be totally zero-cost if you just make it a config option 
(but please make it dynamic with the _default_ just being 0/1 depending 
on whether it's UNIX/Windows, just so that UNIX people can _test_ it 
easily).

		Linus
-

From: Linus Torvalds
Date: Tuesday, February 13, 2007 - 10:23 am

No. HERE's the trivial stupid patch that just marks the core places.

		Linus
---
diff --git a/diff.c b/diff.c
index aaab309..13b9b6c 100644
--- a/diff.c
+++ b/diff.c
@@ -1364,6 +1364,7 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only)
 		s->data = xmmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0);
 		close(fd);
 		s->should_munmap = 1;
+		/* FIXME! CRLF -> LF conversion goes here, based on "s->path" */
 	}
 	else {
 		char type[20];
diff --git a/entry.c b/entry.c
index 0ebf0f0..c2641dd 100644
--- a/entry.c
+++ b/entry.c
@@ -89,6 +89,7 @@ static int write_entry(struct cache_entry *ce, char *path, struct checkout *stat
 			return error("git-checkout-index: unable to create file %s (%s)",
 				path, strerror(errno));
 		}
+		/* FIXME: LF -> CRLF conversion goes here, based on "ce->name" */
 		wrote = write_in_full(fd, new, size);
 		close(fd);
 		free(new);
diff --git a/sha1_file.c b/sha1_file.c
index 0d4bf80..8ad7fad 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2091,6 +2091,7 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object, con
 
 	if (!type)
 		type = blob_type;
+	/* FIXME: CRLF -> LF conversion here for blobs! We'll need the path! */
 	if (write_object)
 		ret = write_sha1_file(buf, size, type, sha1);
 	else
-

From: Junio C Hamano
Date: Tuesday, February 13, 2007 - 11:00 am

Thanks, applied.  I think git-apply has separate codepaths for
both reading and writing; I won't look into them before 1.5.0
but people are welcome to help advancing the cause before I get
to it ;-).

-

From: Linus Torvalds
Date: Tuesday, February 13, 2007 - 12:07 pm

Actually, I did it myself.

This is a "lazy man's auto-CRLF", and it really is pretty simple.

It currently does NOT know about file attributes, so it does its 
conversion purely based on content. Maybe that is more in the "git 
philosophy" anyway, since content is king, but I think we should try to do 
the file attributes to turn it off on demand.

Anyway, BY DEFAULT it is off regardless, because it requires a

	[core]
		AutoCRLF = true

in your config file to be enabled. We could make that the default for 
Windows, of course, the same way we do some other things (filemode etc).

But you can actually enable it on UNIX, and it will cause:

 - "git update-index" will write blobs without CRLF
 - "git diff" will diff working tree files without CRLF
 - "git checkout" will write files to the working tree _with_ CRLF

and things work fine.

Funnily, it actually shows an odd file in git itself:

	git clone -n git test-crlf
	cd test-crlf
	git config core.autocrlf true
	git checkout
	git diff

shows a diff for "Documentation/docbook-xsl.css". Why? Because we have 
actually checked in that file *with* CRLF! So when "core.autocrlf" is 
true, we'll always generate a *different* hash for it in the index, 
because the index hash will be for the content _without_ CRLF.

Is this complete? I dunno. It seems to work for me. It doesn't use the 
filename at all right now, and that's probably a deficiency (we could 
certainly make the "is_binary()" heuristics also take standard filename 
heuristics into account).

I don't pass in the filename at all for the "index_fd()" case 
(git-update-index), so that would need to be passed around, but this 
actually works fine.

NOTE NOTE NOTE! The "is_binary()" heuristics are totally made-up by yours 
truly. I will not guarantee that they work at all reasonable. Caveat 
emptor. But it _is_ simple, and it _is_ safe, since it's all off by 
default.

The patch is pretty simple - the biggest part is the new "convert.c" file, 
but even ...
From: Sam Ravnborg
Date: Tuesday, February 13, 2007 - 1:42 pm

This whole auto CRLF things seems to deal with DOS issues that I personally
have not encountered since looong time ago.
Granted notepad in Windows does not understand UNIX files but that a bug
in notepad and everyone knows that wordpad can be used.

I wonder what we are really trying to address here. Or in other words
could the original poster maybe tell what Windows IDE's that does
not handle UNIX files properly?

core git today should not care about CRLF as opposed to LF end-of-line
as long as the end-of-line is consistent - correct?

So defaulting to autoCRLF in Windows/DOS environments was maybe
sane 10 years ago but today that seems to be the wrong thing to do.
For certain project the option could be useful if the tool-set in
the project *requires* CRLF, but if the toolset like all modern toolset
supports both CRLF and LF then git better avoid changing end-of-line marker.

	Sam
-

From: Nicolas Pitre
Date: Tuesday, February 13, 2007 - 2:08 pm

Maybe you didn't share a work environment with Windows users since 

Windows IDE's can _create_files.  Those files will be CRLF infected.

Also some of them read UNIX files just fine but they will use CRLF to 


Rather git better enforce consistency otherwise it'll be only a mix of 
possible combination as soon as Windows and UNIX users work on the same 
project.


Nicolas
-

From: Linus Torvalds
Date: Tuesday, February 13, 2007 - 4:28 pm

Maybe you stopped using DOS a loong time ago ;)

It's definitely an issue. Yes, all windows programs basically *understand* 
files that have just LF. But almost all of them will *write* files with 
CRLF.

(Which means that I suspect I made the default for "auto_crlf" be wrong in 
my patch: I probably should not default to checking out with CRLF, but 
checking out with just LF, and only do the CRLF->LF conversion on input).

Anybody who has ever worked with _any_ Windows people have long since 
learnt that they always end up having to convert CRLF to just LF when they
get files. Even _I_ know it, and I seldom have to work with people who use 
Windows ;)

So it's a good idea to try to make sure that Windows users don't corrupt 
files by adding CRLF where there is no need for them into a git archive. 
We hope to convert those people to a real OS some day ("here's a nickel, 
boy"), and to make it easier for them to do it, making sure that their 
projects in -git are already in a sane format is probably a good idea.

			Linus
-

From: Sam Ravnborg
Date: Wednesday, February 14, 2007 - 1:41 am

So the issue with git supporting CRLF -> LF is to make interoperability between
UNIX* programs and Windows programs which is anohter domain.

My main objective is the proposal to make a conversion default when many users
do not need it. For the UNIX* compatibility thing having conversion at lowest
Expect that it seems a few br0ken programs yet does not support LF as
end-of-line marker - so .gitattriutes make take special care here.

	Sam
-

From: Linus Torvalds
Date: Wednesday, February 14, 2007 - 9:28 am

Yes, but I also think that even without .gitattributes, you just want to 
have a default for what "text" actually means, and it's entirely possible 
that the default should be: "check out with just LF, and on check-in turn 
CRLF into LF".

But exactly because _some_ programs might want to always see CRLF on input 
too, it should be overridable. 

Or maybe the default should be "turn into CRLF", and there should just be 
an option to make it check out as LF-only.

Regardless, I think that is independent of ".gitattributes". The 
_attribute_ should be "text", but what it then means in practice is a 
separate flag.

And yes, we *could* have a per-file attribute ("text,crlf-checkout") which 
could be used to say "I want to always check out as crlf regardless of any 
other policy") and the same for lf-only, but I seriously doubt that 
anybody really needs that kind of knob-tweaking. At some point it's just 
fine to say "you're crazy".

			Linus
-

From: Sam Ravnborg
Date: Wednesday, February 14, 2007 - 9:47 am

The definition of what is "text" and what action to take upon check-in /
check-out of text is two sepearate things.

I could see it as beneficial as a per-project or even as an overall
git-policy to say "checkin-as-LF" - "checkout-as-LF" to overcome
Which is where I see .gitattributes come into play.
-> A rule that says files with extension .prj and of type "text" shall not see
any conversion.

In this way almost all "text" over time get a proper format and the remaining
brain-dead tools that continue to save in CRLF format will not destroy the sane
LF format.

If anything gets defualt I would vote for LF. But overrideable.

My editor-of-choice does eol auto-sense. If I recall correct it scans the
first 200 lines and counts number of CR,LF,CRLF and based on this judge the
actual eol character used. But not all editors are that sensible :-(

	Sam
-

From: David Lang
Date: Tuesday, February 13, 2007 - 4:19 pm

I've actually run into grief on this subject with perl scripts within the last 
year (files from windows systems with crlf not working cleanly on a linux system 
with just lf)

this is real, not just historic

David Lang
-

From: Alexander Litvinov
Date: Tuesday, February 13, 2007 - 8:47 pm

MS VC has text file for project file but don't like \n line endings, only 
\r\n.
-

From: Junio C Hamano
Date: Tuesday, February 13, 2007 - 10:16 pm

It might be safe for some definition of safe, but it is very
Asian unfriendly.

I'd probably suggest replacing it with what GNU diff uses, which
we stolen and implemented in diff.c::mmfile_is_binary().

-

From: Linus Torvalds
Date: Tuesday, February 13, 2007 - 10:36 pm

Well, the thing is, mmfile_is_binary() doesn't really have a big downside 
if it's wrong one way or the other.

In contrast CR->CRLF conversion, if wrong, actually corrupts binary files. 
So I felt it was better to be really safe than sorry. It's *much* better 
to miss some CRLF translation than to do too much of it.

That said, I'm sure it could be improved a lot. In particular, characters 
in the range 0x00 - 0x1f are clearly "more binary" than the 0x7f+ range, 
with the obvious exceptions (tab, cr, lf).

0x00 - which is the only one mmfile_is_binart() uses - is arguably the 
"most binary" one, of course, but it might be interesting to give 
different weights to the whole range.. In particular, especially for small 
files, the fact that there is no 0x00 byte in no way indicates that it's 
not "binary".

This whole issue is obviously one reason I'd like to involve the filename 
itself, and make it use a ".gitattributes" file - exactly because that 
allows you to be much more aggressive and more precise.

(0x00 may be one of the more _common_ characters in many binary files, 
which makes it a good character to search for too, so I don't really have 
any hugely strong opinions here. After all, the whole heuristic is off by 
default anyway, so it's "really safe" ;^)

			Linus


-

From: Johannes Schindelin
Date: Wednesday, February 14, 2007 - 4:10 am

Hi,


Last time I checked, the text files never had lines longer than 200 
characters (I chose this intentionally large). So, it might be a good 
heuristic to check the maximal line length, and refuse to believe that 
it's text once a certain (configurable) threshold is reached.

Ciao,
Dscho

-

From: Mark Levedahl
Date: Wednesday, February 14, 2007 - 7:26 am

Unfortunately, on my program we have folks using text files with single 
lines over 60,000 characters long, these are data files. Think for 
example of a comma or tab separated data file saved from a spreadsheet. 
In this case, the files are pure ascii. So, the line length could be 
something else to take into account, but is not decisive by itself.

To recap, we have the following various suggestions to determine textness:

1) ratio of ascii to non-ascii characters, possibly weighting some chars 
more than others
2) line length
3) existence of a null (\0)
4) file name globbing
5) roundtrip ( lf(crlf(file) ) == file

I don't think any one suggestion is completely adequate for all uses, 
all need to be available, somehow configurable. This suggests to me a 
core.AutoCRLFstrategy variable that is a comma separated list of methods 
to use (set to a reasonable default of course that does not cause 
runtime headaches on Unix): a file would be deemed binary unless all 
listed methods declare the file as text (with an empty list disabling 
AutoCRLF detection).

Mark

-

From: Linus Torvalds
Date: Wednesday, February 14, 2007 - 8:51 am

Actually, my patch already had one that you didn't mention: 
 6) CR never shows up alone.

So the patch I sent out basicallyhad the following rules:
 - no more than ~10% of all characters being other than regular printable 
   ASCII (where any control character except for newline/cr/tab was deemed 
   nonprintable)
 - any "lonely" CR automatically means it's binary, and I would refuse 
   to convert that to a LF (the test in the code is that CRLF count must 
   match CR count)

but the "roundtrip" rule is much too strict (it's actually perfectly 
possible for an editor to add CRLF characters only to new _lines_, leaving 
old lines with just LF - or the other way around. In fact, the editor I 
use under Linux does exactly that in reverse - if I add new lines, it will 
add those without CR, but will leave old lines with CRLF alone).

I think that to help asian languages (or strange text-files in utf8 or 
Latin1 too, for that matter: test-files with _just_ special characters), I 
should probably make the rule be that only the 0-31 range is special.

			Linus
-

From: Junio C Hamano
Date: Wednesday, February 14, 2007 - 9:39 am

I would agree.  0-31 except HT, CR, LF and ESC would be a good
idea; that would not harm text in UTF-8, EUC based various
locales nor ISO 2022.

Patch is relative to 'pu'.
-- >8 --

diff --git a/convert.c b/convert.c
index ebcf717..b6b7c66 100644
--- a/convert.c
+++ b/convert.c
@@ -13,7 +13,7 @@ struct text_stat {
 	unsigned cr, lf, crlf;
 
 	/* These are just approximations! */
-	unsigned printable, nonprintable, nul;
+	unsigned printable, nonprintable;
 };
 
 static void gather_stats(const char *buf, unsigned long size, struct text_stat *stats)
@@ -34,13 +34,11 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 			stats->lf++;
 			continue;
 		}
-		if (c == '\t' || (c >= 32 && c < 127)) {
-			stats->printable++;
+		if ((c < 32) && (c != '\t' && c != '\033')) {
+			stats->nonprintable++;
 			continue;
 		}
-		if (!c)
-			stats->nul++;
-		stats->nonprintable++;
+		stats->printable++;
 	}
 }
 
@@ -50,7 +48,7 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 static int is_binary(unsigned long size, struct text_stat *stats)
 {
 
-	if (stats->nul)
+	if (stats->nonprintable)
 		return 1;
 	/*
 	 * Other heuristics? Average line length might be relevant,

-

From: Linus Torvalds
Date: Wednesday, February 14, 2007 - 10:01 am

Yeah, I think we can ignore them..


You could possibly add 127 to the list too (it's ascii DEL, I don't know 

But this is too harsh.

It's quite common to have the occasional FF character. Some things really 
do use it for page breaks. So saying that *any* nonprintable character is 
bad is not a good idea.

Same goes for BS (some programs use it to show bold and underlined text: 
man-pages, for example).

		Linus
-

From: Junio C Hamano
Date: Wednesday, February 14, 2007 - 10:29 am

Ok.  How about adding BS and FF to the Ok set, and checking if
bad ones are less than 1% of the good ones?

diff --git a/convert.c b/convert.c
index b6b7c66..b0c7641 100644
--- a/convert.c
+++ b/convert.c
@@ -34,11 +34,22 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 			stats->lf++;
 			continue;
 		}
-		if ((c < 32) && (c != '\t' && c != '\033')) {
+		if (c == 127)
+			/* DEL */
 			stats->nonprintable++;
-			continue;
+		else if (c < 32) {
+			switch (c) {
+				/* BS, HT, ESC and FF */
+			case '\b': case '\t': case '\033': case '\014':
+				stats->printable++;
+				break;
+			default:
+				stats->nonprintable++;
+			}
+			
 		}
-		stats->printable++;
+		else
+			stats->printable++;
 	}
 }
 
@@ -48,7 +59,7 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 static int is_binary(unsigned long size, struct text_stat *stats)
 {
 
-	if (stats->nonprintable)
+	if ((stats->printable >> 7) < stats->nonprintable)
 		return 1;
 	/*
 	 * Other heuristics? Average line length might be relevant,

-

From: Linus Torvalds
Date: Wednesday, February 14, 2007 - 10:43 am

I think that looks fine.

		Linus
-

From: Johannes Schindelin
Date: Wednesday, February 14, 2007 - 8:56 am

Hi,


This sounds regretfully complex. Somebody (you?) mentioned that cvsnt does 
a kick-ass job here. Does cvsnt need strategies? I don't think so. Neither 
do we. Someone who cares enough should just rip^H^H^Hlook at cvsnt's text 
detection.

Ciao,
Dscho

-

From: Linus Torvalds
Date: Wednesday, February 14, 2007 - 9:23 am

Well, one thing to keep in mind is that for source code in particular, 
this really very seldom is an issue.

So you can do a really *bad* job in theory, and in practice it really 
works very very well.

Very few people keep binary blobs in any SCM archive _anyway_, partly 
because they've always been told that it's unsafe (and with a lot of SCM's 
it is), but even more because binary blobs are almost always generated by 
some build method, so normally you'd never version them in the first 
place, or versioning isn't all that helpful.

And most binary blobs are so *obviously* binary that even the stupidest 
algorithm on earth will get it right. The only hard cases actually tend to 
be really tiny files, or literally test-sequences.

Tiny files are hard because:

 - they (by being tiny) have so few characters that they can easily lack 
   a "fingerprint" character (eg a NUL character or similar). 

 - tiny files are a lot more likely than bigger files to have strange 
   statistics that throw some more "sophisticated" rule off the scent. 
   Something like a "10% rule" tends to work fine if you have a big text, 
   and ten percent is still a reasonable number to average things out 
   over, but what if you only had ten characters to begin with?

The good news is that tiny files can usually be considered text, since 
you'd seldom use a binary format for something really small anyway.

So I suspect that IN PRACTICE, especially if you come as a CVS replacement 
(where binary files are just damn hard to get right even under the best of 
circumstances!), you can do just about anything, including just saying 
"everything is text", and you'd be fine.

It's entirely possible that that is exactly what CVSNT does ;)

		Linus
-

From: Mark Levedahl
Date: Wednesday, February 14, 2007 - 10:28 am

I agree that is complex, I started thinking of PAM when I wrote that, 
leading to, "this aint gonna work." But in the modern day let's all feel 
good spirit of "there are no stupid ideas, just some are better" I threw 
it out anyway.

As to cvsnt, my actual feeling is I'd like to kick it in the ass, it has 
destroyed too many files for me over the years, binary and text, so I 
don't think its strategies are very good. That is why I'm kicking these 
ideas around, if I thought I knew the "right" way I would have written 
it already.

Mark

-

From: Robin Rosenberg
Date: Wednesday, February 14, 2007 - 11:17 am

That may be why an excellent piece of software, TortoiseCVS,  doesn't trust 
cvs or cvsnt to do the job. Here is how they do the binary detection (and 
some more):

http://tortoisecvs.cvs.sourceforge.net/tortoisecvs/TortoiseCVS/src/CVSGlue/CVSStatus.c...

-- robin
-

From: Linus Torvalds
Date: Wednesday, February 14, 2007 - 11:31 am

Well, it does seem to boil down to what Junio already got to:

 - 0-31 and 127 are never in text, except for BEL, BS, HT, LF, FF, CR and 
   ESC.
 - 128-255 can all be in either iso-8859 or extended ascii (or they 
   explicitly add NEL but not 128+27 to "normal ASCII", which is strange)

So they've effectively added BEL and ESC to the listof characters that 
Junio has now. But they also make it an absolute error to have anything 
else (no "1% rule").

But they also do the filename tests, and I think that's more important in 
many ways.

		Linus
-

From: Robin Rosenberg
Date: Wednesday, February 14, 2007 - 1:24 pm

Especially ESC used to be common in DOS/Windows and quite a few hang around in
Can this 1%-rule be motivated from real cases, rather that hypotetical ones? It makes 

A unixy tool like git should maybe use magic too :).

Btw the filename (like .gitignore or similar) test in practice would give us 
 the binary flag. Just list a filename instead of a pattern.

-- robin
-

From: Linus Torvalds
Date: Wednesday, February 14, 2007 - 8:44 am

No, some broken editor programs and people use "flowing text" files, where 
a newline is actually a _paragraph_ end. You have lines in the hundreds 
(and thousands) of characters, and the program will just flow the text for 
you.

Ugh. Horrible, I know. 

		Linus
-

From: Johannes Schindelin
Date: Wednesday, February 14, 2007 - 8:53 am

Hi,


Good point.

Ciao,
Dscho

-

From: Alexander Litvinov
Date: Wednesday, February 14, 2007 - 4:36 am

Wow ! Thanks. 

I just tried this patch and it works! From now I can use git-cvsimport under 
Linux and then clone it to cygwin and work there with full history. Nice, 
very nice. In my case text file detection work well as far most of our files 
are .cpp and .h
-

From: Linus Torvalds
Date: Wednesday, February 14, 2007 - 9:37 am

Btw, it didn't do any commit message conversion etc, so you'll still 
always see commit messages with LF-only, and if you _create_ commits, you 

Yeah, considering that it worked in my testing for "git" itself, I'm not 
surprised. Source code tends to look the same..

		Linus
-

From: Junio C Hamano
Date: Wednesday, February 14, 2007 - 10:18 am

I think stripspace removes CR so we should be Ok.

-

From: Johannes Schindelin
Date: Tuesday, February 13, 2007 - 11:05 am

Hi,


Why the haste all of a sudden? Your patch is easily applyable for anyone 
who wants to work on text/binary or arbitrary file types. No need to rush 
a developers-only patch into git.git.

Ciao,
Dscho

-

From: Nicolas Pitre
Date: Tuesday, February 13, 2007 - 10:25 am

git-cat-file, and its counter part git-hash-object, are fairly low level 
plumbing.  Anyone using them should be aware of the issue and apply the 
needed conversion.  And actually, since we're going to have the 
conversion routines in the core, we'd only need to add a --crlf argument 
to both of them to optionally perform the conversion since the user of 
those commands is more likely to know if the conversion is needed.


Nicolas
-

From: Johannes Schindelin
Date: Tuesday, February 13, 2007 - 11:04 am

Hi,


No hooks means something like cvsnt does, and that means no .gitattributes 
either. (BTW I really hate .gitattributes, as it does not at all say what 
this is about; it's about file _conversions_, not attributes).

CVSNT analyzes the files, and guesses if they are text, and only then 
activates the text mode.

I am strongly opposed to including something like that. (It was already 
proposed, and your "no hooks" suggests the same.)

However, I am slightly positive about the .gitfiletypes approach, _iff_ we 
think about more than just text/binary from the start. If we do it right, 
it will buy us more.

Ciao,
Dscho

-

From: Junio C Hamano
Date: Tuesday, February 13, 2007 - 11:11 am

We might start with only binary/text attributes, but we may add
more later, e.g. chmod=o-rwx.  I do not see much differnece
between attributes vs filetypes.


-

From: Linus Torvalds
Date: Tuesday, February 13, 2007 - 11:39 am

No, it *is* about attributes.

In order to know how to convert, you need to know the attributes of the 
file.

So it's not about conversion: we would ALWAYS do conversion. It's about 
the fact that in order to do the conversion, we need to know what the 
attributes of the file is - is it text, or what.

And the equal point is that there are _other_ attributes that git might 
care about. The "merge strategy" attribute, for example. Or "owner" 
attributes for files etc.

		Linus
-

From: Johannes Schindelin
Date: Tuesday, February 13, 2007 - 11:42 am

Hi,


Yes, you're right. Colour me converted (pun intended).

Ciao,
Dscho

-

Previous thread: Efficiency of initial clone from server by Jon Smirl on Sunday, February 11, 2007 - 12:53 pm. (31 messages)

Next thread: [RFC] Speeding up a null fetch by Julian Phillips on Sunday, February 11, 2007 - 4:32 pm. (5 messages)