I am NOT intending to start a flamewar O:-) , so please don't turn this into one. The recent threads on a mingw git port are explicit in the intent to provide a Windows native git. I believe there is a fundamental conflict here with the position, clearly stated by Linus, that git does not alter content in any way. Windows suffers the curse of DOS line endings (\r\n vs \n), and a true port to Windows *must* allow for \r\n and \n to be semantically the same thing as most large projects end up with a mixture of such files and/or are targeting cross-platform capabilities. The major competing solutions git seeks to supplant (cvs, cvsnt, svn, hg) have capability to recognize "text" files and transparently replace \r\n with \n on input, the reverse on output, and ignore all such differences on diff operations. To be relevant on native Windows, git must do the same. Otherwise, git will be deemed "too wierd" and dismissed in favor of a tool "that works." There is no use to debating the technical merits of \r\n vs \n vs \r vs whatever, nor of not converting. Really. Just accept that there is a fundamental requirement that any version control tool on Windows be able to silently convert between \r\n and \n. To believe otherwise is to expect that the conversion be pushed elsewhere into the tool chain in use, and that won't happen as the competition already provide this conversion capability. So, I think the git project needs to come to an explicit position on this, basically being: 1) git is a POSIX only tool (i.e., there will be no \r\n munging), or 2) a Windows port of git will handle and mung \r\n and \n line endings. If the answer is 1, the mingw port is a waste of time as it simply won't be usable by its target audience. If the answer is 2, then I think a very careful design of this capability is in order. Comments? BTW, I have addressed this in my own world using a pre-commit script that converts textfile line endings into \n, recognizing that our ...
Hi, Agree with transformations on input and output; disagree on diff. The problem is that it really is a transformtion. Since most Windows tools (at least those used in portable software) handle \n without \r quite well, thank you, I'd tend towards the view point: do not mess with line endings pre-commit/post-checkout. Even MacOSX uses \n now, instead of \r. Of course, for those projects which _use_ CRLF: they can continue with it. Git has no problem with those line endings. The only problem CVS tried to solve (badly) was to be able to checkout text files on DOS, Unix _and_ MacOS. In practice, though, this use case does not matter anymore IMHO. Ciao, Dscho -
m
So, the basic design for this feature exists where? I would assume this would include a file mode indicator set in the blob or tree designating the blob is "text", along with mechanism to specify for a project what files are "text", along with some safety valve to check and not do transformation when the file does not look text-ish. Mark -
So this is something that I've tried proposing to the Mercurial developers, but it's never been implemented in hg. It'll be interesting to see what the git community thinks. :-) My proposal does require adding a file type to each file, as tracked metadata, which may doom it from the start. If you add a file type, then you have to support mutating the file type, and some way of handling merge conflicts (generally, picking one type or another). Then for each file type, we implement a set of interfaces (perhaps as simple as a series of executables named git-<type>-<operation>) which if present, transforms the file from its live format to the canonical format which is actually checked in and back again. Besides using this for the DOS CR/LF problem, it also allows for an efficient storage of things like OpenOffice files which are a zipped set of .xml files. By decompressing them before pushing them into the SCM, it means that if the user makes a tiny spelling correction in their OpenOffice file, the delta stored in the git repository can be much more efficiently stored (since the diff of the .xml tree will be small, where as the diff of the entire compressed file is likely going to be close to the entire size of the .odt file). Another nice thing to provide for each file type would be a pretty-printer for the diffs, so it becomes easier to see the delta between two versions of an OpenOffice file in a textual window. So, is this idea sane or completely insane? Hopefully it passes Linus's it-solves-multiple-problems-at-once test, at least. :-) - Ted -
there have been other things discussed that could use the 'do this on checkout' hooks, specificly on the issue of useing git to manage /etc the need to save/restore permissions requires a hook on checkout that doesn't exist yet. this sounds like it would solve that problem as well. David Lang -
Hi, I'd rather do that a la .gitignore, i.e. make this handling dependent on file name patterns. It is not only backwards compatible (from the viewpoint of the repository format), it also avoids having to specify over Again, I propose a slight change: Let's add a transformation driver like the merge driver: this allows inlining common operations like unzipping, CRLF->LF conversion, etc. Ciao, Dscho -
I agree that a file-type approch would work, but I personally think it's
too inflexible (just cr/lf vs lf? There are tons of other interesting
issues that are valid). I also think it falls down on another (and in some
ways much more fundamental problem): these things exist EVEN WHEN THE FILE
ITSELF DOES NOT EXIST!
In other words, a policy about cr/lf is *not* a policy about actual
content. It's something much more: it's a policy about representation in
general, which includes *potential* content. It should obviously take
effect on "git add" even with content that didn't exist before, and to
work well, it should do so without the user having to think about it.
Equally importantly, this happens with content that was added by people
who simply DO NOT CARE. In other words, I think a "file type" thing
fundamentally cannot work, because under UNIX, it would be stupid and
pointless, so any project that is maintained under UNIX might _add_ the
file types, but since they won't matter, they'll inevitably be wrong (ie
people forgot to mark a binary thing binary, or a text thing as text).
So: file types or attributes are broken. They cannot work well.
But enough on the negative rambling, I do have a positive and constructive
suggestion, because I actually think I have a great model for it. But I've
never cared enough (and since the main target would be some windows issue,
I suspect I never really _will_ care enough) to really worry about it.
Anyway, if somebody really wants to look at this, and wants to create
something that is actually _usable_, my suggestion is to simply extend on
the ".gitignore" file approach. The great thing about .gitignore is that
(a) you can track it like you track any other file
This makes merges a *lot* easier. You see it as conflicts, you can
fix it up, and in general, you can use all the same tools with it as
you use with anything else. In contrast, explicit per-file filetypes
are _horrible_ for ...Hi, [I agree on the .gitignore approach; see my other mail in this thread] Not so fast. In order for this to be _useful_, you also have to have a way to _extract_ the text blobs. Not only for read-tree, but _also_ for diff. It makes no sense at all to have this transformation one-way. For diff, you _might_ want to have a diff beautifier (for example the .odt thing), but read-tree is _really_ important. Ciao, Dscho -
Actually, my argument is that we don't need it all that much.
For example, your "read-tree" argument is actually wrong. Anything that is
in a tree is _already_ fixed to be '\n'. So as long as we keep to things
like
git diff version1..version2
we'll actually always get the right version.
Also, the index will make sure that we don't even *try* to diff normal
checked out files.
So the only time you actually really need to test the .gitattributes file
is when you do an "open blob in working tree". And once you do that
function right, and just make sure both git-update-index and yes, the
"diff against working tree" cases use it, you really should be mostly
done.
Both git-update-index and git-diff-files want basically the same
interface:
struct file_buf {
const char *buf;
unsigned long size;
int flags;
}
int read_file(const char *path, struct file_buf *);
close_file(struct file_buf *);
and we should use that instead of the current "open + stat + mmap/read +
close" sequences.
It really shouldn't be too nasty.
Linus
-
I agree that we can assume editors can grok files with LF end-of-line just fine and we would not need to do the reverse conversion on checkout paths (e.g. "read-tree -u", "checkout-index"). Textual diff generation needs to learn the CRLF-to-LF conversion in diff_populate_filespec(); this needs to be done even when the caller wants size_only. Oops. Not me. -
If I were to do this, I would say the cache should store the size on the filesystem in stat fields. Which means that the object name recorded is text blob _after_ line endings are normalized to LF, and its exploded size does not necessarily match the cached size. So this means that whoever does the diff_populate_filespec() change needs to be careful, but it is not such a big deal. -
Umm. There's two (very distinct) uses for st_size. The one that we actually use to validate the current index obviously must match the "OS returned value". It contains all the CR/LF stuff. The one where we actually read the file and run SHA1 on the result must obviously be the post-conversion one. But it shouldn't be a problem. We'll always know which one matters: the index case is always about pure stat information (and has no meaning outside of that, really - after all, it's no different from st_mode etc, and we actually keep it in a special binary format that is endian-safe!) and the "real object" case is always about the *data* we use to compare with. I don't think we ever mix the two anyway. Linus -
In fact, for git-update-index, I think it's *literally* as easy as just changing "index_fd()" to convert the buffer on-the-fly as needed, before we actually call "write_sha1_file()" or "hash_sha1_file()". So we'd just need to pass in the information about whether it's binary or not, and then do something like @@ -2091,6 +2091,10 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object, con if (!type) type = blob_type; +#ifndef __UNIX__ + if (text && !strcmp(type, blob_type)) + convert_crlf_to_lf(&buf, &size); +#endif if (write_object) ret = write_sha1_file(buf, size, type, sha1); else and that would take care of a lot of things (yeah, I'd not do it that way in practice, but really doesn't look that nasty - it's actually much nastier to have to look up the text/binary type in the first place). Something similar looks to be true in diff generation. The core "compare two SHA1's at a time" doesn't need any changes, but the code that actually reads in the temporary file from disk obviously does. But even that is just _one_ point, afaik - diff_populate_filespec()": @@ -1362,6 +1362,10 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only) if (fd < 0) goto err_empty; s->data = xmmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0); +#ifndef __UNIX__ + if (text) + convert_crlf_to_lf(&s->data, &s->size); +#endif close(fd); s->should_munmap = 1; } (and again, that's not real code, it would also need to change the "should_munmap" flag to indicate the state of the _new_ "data" thing. Linus -
you could do something like this and it would deal with the srlf/lf problem, but if you instead put in the conversion hooks like Ted suggested then you can actually gain a LOT more. his example of openoffice documents that are gziped xml files is a very good one. if the 'conversion' is to gunzip on checkin and gzip on checkout then the core git logic will work on the nice diffable xml instead of the compressed binary blob. if this is extensable to arbatrary helper functions to do the conversions I'll bet that there are many other cases that can use this. I think the big questions needs to be, is this helper app a filter, or can it be passed a filename as the destination (which would let it do things like set permissions on the files it creates), or should it be both? David Lang -
Hi, In that case, a simple pre-commit hook would suffice. No, the problem mentioned by Mark was a very real one: you _cannot_ rely on Windows' editors not to fsck up with line endings. The worst case is if the file contains _some_ CRLF and _some _LF_. Almost always I had the problem that it now converted _all_ LFs to CRLFs. Even those which already were converted. So, if we are to support text mode, it is not one-way. If we do one-way, we really do _not_ support text mode, but pre-commit conversion to LF style text. And in this case, core git does not need _any_ change. Ciao, Dscho -
Well I disagree in two counts. - I do not see how you propose to solve some CRLF and some LF case with both-ways conversion. - Pre-commit hook would not be sufficient. In a edit, diff, test and then commit cycle, diff and test step needs to look at whatever the editor left on the filesystem, so the changes to populate-filespec is needed to make diff part work. -
Hi, Yes, you are right. However, since this is all post-1.5.0 (right? Right?) why not go with more of Ted's proposal, and make this whole mess also usable for other things than just crlf issues? And I _really_ think that you do not help Windows people by doing this one-way thing. Ciao, Dscho -
Whatever is done, it needs to be robust to the notion that people will fail to set the correct file type somewhere. Current cvsnt is fairly good at autodetecting and setting text vs binary file type, and enforces this across all platforms, so things don't go awry too often. It is in my experience more reliable than subversion, which basically relies upon file extensions mapping to mime types to identify content. All of which is a very much too low standard of accuracy for a version control system: I lost many files per year due to the above nonsense, so I worry about trying to create a very general transform solution and not making it really, really failsafe. Having projects define individual globbing patterns is good, double checking the content for sanity is an absolute must, but I don't think that is enough. I suspect the solution should include round-trip conversion when creating blobs to assure that the input can be exactly reconstructed by the inverse transformation (and therefore possibly rejecting input with mixed line endings). A similar check could be applied on checkout. Perhaps I'm too paranoid, but I've been burnt way too many times by text/binary mode stuff to let this part be trivialized. Maybe it only gets enabled by core.ImReallyParanoid, but I want that option. Mark -
Hi, Be aware that what you proposed costs many CPU cycles. I am totally opposed to enabling that option by default on all platforms. I am okay with .gitattributes (but I would call it .gitfiletypes), but I am _not_ okay with git being _too much_ fscked up by Windows. Microsoft has done enough harm already. Ciao, Dscho -
Indeed; this type of checking should only occur if there is a filter applied to a file. Most files in most projects would hopefully just be considered to be byte streams to Git, like they are today, and thus not incur any additional overhead, beyond matching their type to determine they are in fact just a byte stream. The type could be cached in the index; or at least a single bit which says "I'm just a byte stream, thanks" so that the matching only needs to occur during an initial read-tree. -- Shawn. -
for the limited case of line endings it may be reasonable to define the internal git format to be lf, and if you are running on a platform that uses this nativly no transition is needed one possible way to make this be a general feture is to have the helper script have a --needed flag that tells git if it would do anything on the current platform or not. this way you don't need to run it (and sanity check it) if it's not needed. David Lang -
I would assume that none of this crlf stuff exists at all on Linux / Unix / Posix, so if done right has zero impact outside of the Windows nuthouse. Inside that, folks are already so used to incredible slowness in file I/O that I'm not sure the round tripping I suggest as a check would be very noticeable, but in any case I fully agree it should be optional even there. However, if git could support something that never screws up, absolutely guaranteeing data integrity in the presence of these transforms, that would be a first in this arena and I believe a significant selling point. Mark -
There is obviously much sentiment that this should _not_ be the default (and I agree). But if arbitrary filters are possible, then you can theoretically write an 'autocrlf' filter which will try to do the right thing, and you could set it for some or all files: echo '*: autocrlf' >.gitattributes but it would be off by default. If we implement this, everyone has to "pay" for .gitattributes (even if you don't use it, we have to look it up to make sure you're not using it!), but nobody has to pay for any filters they don't use. -Peff -
the expectation is that the some-of-each situation is unlikly to happen if you convert all the time. and if you do end up with a mixed ending file, the next time you check it in from a windows box it should clean it up. David Lang -
In my work flow, I am using a pre-commit script that (among other things) rewrites all text files to have \n endings. This is a one-way conversion, and does work well for the set of tools I am using. The converters I use I wrote years ago, and are smart enough to deal with mixtures of \n, \r\n, and \r line endings in one file, transforming all into one unified form. d2u / u2d were not that robust when I last tried them (years ago), but this is an absolute necessity. However, I don't think the one-way conversion is acceptable across the board. While the only Windows editor I am aware of that doesn't grok \n is Notepad (the moral equivalent of edlin), I suspect that undo reliance upon this will still lead to grief. If nothing else, someone, somewhere will find that their beloved crlf's are missing and will complain. Loudly. And in the lore, git will become known for being "wierd." So, I suspect a checkout script is necessary. Mark -
Heh, a lofty goal. And I am glad to see that a thread full of constructive suggestion is already going on. So now I do not have to fear starting a flamewar; I can safely I think there is a fundamental misconception in the above. I do not know about others, but to me personally, I do not see any "seeking to supplant", nor "competition". It's not like I or people who raised git into the current shape are begging to windows users to consider using git and bending backwards to please them. You should hone your diplomacy. Current git may or may not match what they need, and if it does not match what they need, making it match what they need is primarily the responsibility of them. If Windows users find something in git that is interesting and useful, but if they find something else lacking in it to be truly useful for them, they can submit patches, or if they cannot implement the changes themselves but only have wishlist items, then _they_ can do the begging. People in git community are certainly friendly and helpful bunch, and some (including me) are unfortunate enough that sometimes they have to touch Windows, so some degree of need is felt to support Windows better even within the community, but it has never been high priority. Making it higher priority by bringing in better ideas and starting the fire must come from I do not think git project needs to do any such thing. The project evolves reflecting the needs of its users, and the design is not decided upfront without doing any feasibility study. I would certainly not say our position is (1), IOW, I would not say we will rule out Windows support. If it can be reasonably done without harming the code, why not? Depending on how cleanly a change Windows users want is done without negatively affecting the existing users, it may or may not be judged acceptable. We will know only when we see at least the design and preferably the code. I feel no need to This is not just you, and fortunately it does not ...
Junio C Hamano wrote: > Mark Levedahl <mlevedahl@verizon.net> writes: > >> I am NOT intending to start a flamewar O:-) , so please don't turn >> this into one. > > Heh, a lofty goal. And I am glad to see that a thread full of > constructive suggestion is already going on. > > So now I do not have to fear starting a flamewar; I can safely > vent. Junio, I meant absolutely no offense in anything I wrote, and sincerely apologize if any was taken. My past experiences caused me to be skeptical that a significant change to accommodate a very bad design of Windows would be accepted here. Happily, that skepticism was misplaced. I am much heartened by the responses, and am optimistic a good solution will be found that is acceptable to all. It is very clear that the group is open and supportive of working through the issues to help this, and I intend to contribute to that solution. (If nothing else, I would like to be known for something besides some modest ability to hack around Tk bugs). So, I trust the flamethrowers can remain buried. Mark -
None taken, although I admit that I was somewhat annoyed, having to write the first part of my response. -
I am strongly object this statement. I develop one project under Windows and use Cygwin git for this. Yes, I have a problem with git's thinking line ending is a \n but most of troubles are diff and rebase. In general git works well with \r\n line endings. When I have file that was converted from dos to unix format (or from unix to dos) git genereta big diff. But anyway, c++ compiler works well with both formats and in this case I simply convert file to dos format and git shows again nice diff. If unix format was commited to git I simply change the format and commit that file again. The only trouble is the rebase, it does not like \r\n ending and othen produce unexpected merge conflict. But I don't use rebse to othen to realy investigate and try to solve the problem. -
Hi, Well, if everybody thinks like you, maybe we do not have to change anything for Windows after all? Ciao, Dscho -
If you are tring to build history that looks good - you are right this is a I still wish to have working rebase so if git will hanle somehow \r\n it would be nice. But please do not produce the same behavior as cvs does: under cygwin it still use \n ! By the way, most windows programmers I work with says 'git is cool but is there gui like tortoise or wincvs ?' :-) -
Hi, You really should teach format-patch to output \n patches, and keep all Some time ago, I started playing with a shell extension. Now that MinGW git is almost there, I might clean it up... Would you be interested in working on it, or is this just wishtalk? Ciao, Dscho -
No no no. It's going to be _horrible_ if people start interesting projects in Windows, and there are files in a git repository that are encoded with CRLF. I'd much rather just get this right, and that means "no hooks". If people start using commit hooks etc, that will just mean that they won't use them for all-windows environments (why use it? Everybody hass CRLF, and everybody _wants_ CRLF), or it will just be relatively expensive to have a complex hook anyway. So I think we should plan on something like .gitattributes or similar, so that we _can_ handle mixed environments well, without any real setup or any real costs. The costs really shouldn't be too high - we tend to avoid doing any expensive working tree changes *anyway*. For example, even "git checkout" has a huge optimization to avoid rewriting files that are already ok, so doing things like switching whole branches usually wouldn't even need any conversion for most files - even on platforms like Windows that need the conversion in the first place. So considering that it looks _trivial_ for git-update-index, fairly easy for diff generation, and I doubt "git checkout" is really likely to be any worse either, this should just be somethign we do. The *ONLY* case where we may not be able to do things automatically is actually a much more subtle one: "git cat-file". If we just get a SHA1, we don't know what the path to look it up was like, and thus we can never know whether it's a binary or a text object. With "-p" we can trivially guess, of course, but "git cat-file blob" simply must not do that! But that really doesn't sound like a big problem to me ;) Linus -
Here's a patch that I think we can merge right now. There may be other places that need this, but this at least points out the three places that read/write working tree files for git update-index, checkout and diff respectively. That should cover a lot of it. Some day we can actually implement it. In the meantime, this points out a place for people to start. We *can* even start with a really simple "we do CRLF conversion automatically, regardless of filename" kind of approach, that just look at the data (all three cases have the _full_ file data already in memory) and says "ok, this is text, so let's convert to/from DOS format directly". THAT somebody can write in ten minutes, and it would already make git much nicer on a DOS/Windows platform, I suspect. And it would be totally zero-cost if you just make it a config option (but please make it dynamic with the _default_ just being 0/1 depending on whether it's UNIX/Windows, just so that UNIX people can _test_ it easily). Linus -
No. HERE's the trivial stupid patch that just marks the core places.
Linus
---
diff --git a/diff.c b/diff.c
index aaab309..13b9b6c 100644
--- a/diff.c
+++ b/diff.c
@@ -1364,6 +1364,7 @@ int diff_populate_filespec(struct diff_filespec *s, int size_only)
s->data = xmmap(NULL, s->size, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
s->should_munmap = 1;
+ /* FIXME! CRLF -> LF conversion goes here, based on "s->path" */
}
else {
char type[20];
diff --git a/entry.c b/entry.c
index 0ebf0f0..c2641dd 100644
--- a/entry.c
+++ b/entry.c
@@ -89,6 +89,7 @@ static int write_entry(struct cache_entry *ce, char *path, struct checkout *stat
return error("git-checkout-index: unable to create file %s (%s)",
path, strerror(errno));
}
+ /* FIXME: LF -> CRLF conversion goes here, based on "ce->name" */
wrote = write_in_full(fd, new, size);
close(fd);
free(new);
diff --git a/sha1_file.c b/sha1_file.c
index 0d4bf80..8ad7fad 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2091,6 +2091,7 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object, con
if (!type)
type = blob_type;
+ /* FIXME: CRLF -> LF conversion here for blobs! We'll need the path! */
if (write_object)
ret = write_sha1_file(buf, size, type, sha1);
else
-
Thanks, applied. I think git-apply has separate codepaths for both reading and writing; I won't look into them before 1.5.0 but people are welcome to help advancing the cause before I get to it ;-). -
Actually, I did it myself. This is a "lazy man's auto-CRLF", and it really is pretty simple. It currently does NOT know about file attributes, so it does its conversion purely based on content. Maybe that is more in the "git philosophy" anyway, since content is king, but I think we should try to do the file attributes to turn it off on demand. Anyway, BY DEFAULT it is off regardless, because it requires a [core] AutoCRLF = true in your config file to be enabled. We could make that the default for Windows, of course, the same way we do some other things (filemode etc). But you can actually enable it on UNIX, and it will cause: - "git update-index" will write blobs without CRLF - "git diff" will diff working tree files without CRLF - "git checkout" will write files to the working tree _with_ CRLF and things work fine. Funnily, it actually shows an odd file in git itself: git clone -n git test-crlf cd test-crlf git config core.autocrlf true git checkout git diff shows a diff for "Documentation/docbook-xsl.css". Why? Because we have actually checked in that file *with* CRLF! So when "core.autocrlf" is true, we'll always generate a *different* hash for it in the index, because the index hash will be for the content _without_ CRLF. Is this complete? I dunno. It seems to work for me. It doesn't use the filename at all right now, and that's probably a deficiency (we could certainly make the "is_binary()" heuristics also take standard filename heuristics into account). I don't pass in the filename at all for the "index_fd()" case (git-update-index), so that would need to be passed around, but this actually works fine. NOTE NOTE NOTE! The "is_binary()" heuristics are totally made-up by yours truly. I will not guarantee that they work at all reasonable. Caveat emptor. But it _is_ simple, and it _is_ safe, since it's all off by default. The patch is pretty simple - the biggest part is the new "convert.c" file, but even ...
This whole auto CRLF things seems to deal with DOS issues that I personally have not encountered since looong time ago. Granted notepad in Windows does not understand UNIX files but that a bug in notepad and everyone knows that wordpad can be used. I wonder what we are really trying to address here. Or in other words could the original poster maybe tell what Windows IDE's that does not handle UNIX files properly? core git today should not care about CRLF as opposed to LF end-of-line as long as the end-of-line is consistent - correct? So defaulting to autoCRLF in Windows/DOS environments was maybe sane 10 years ago but today that seems to be the wrong thing to do. For certain project the option could be useful if the tool-set in the project *requires* CRLF, but if the toolset like all modern toolset supports both CRLF and LF then git better avoid changing end-of-line marker. Sam -
Maybe you didn't share a work environment with Windows users since Windows IDE's can _create_files. Those files will be CRLF infected. Also some of them read UNIX files just fine but they will use CRLF to Rather git better enforce consistency otherwise it'll be only a mix of possible combination as soon as Windows and UNIX users work on the same project. Nicolas -
Maybe you stopped using DOS a loong time ago ;) It's definitely an issue. Yes, all windows programs basically *understand* files that have just LF. But almost all of them will *write* files with CRLF. (Which means that I suspect I made the default for "auto_crlf" be wrong in my patch: I probably should not default to checking out with CRLF, but checking out with just LF, and only do the CRLF->LF conversion on input). Anybody who has ever worked with _any_ Windows people have long since learnt that they always end up having to convert CRLF to just LF when they get files. Even _I_ know it, and I seldom have to work with people who use Windows ;) So it's a good idea to try to make sure that Windows users don't corrupt files by adding CRLF where there is no need for them into a git archive. We hope to convert those people to a real OS some day ("here's a nickel, boy"), and to make it easier for them to do it, making sure that their projects in -git are already in a sane format is probably a good idea. Linus -
So the issue with git supporting CRLF -> LF is to make interoperability between UNIX* programs and Windows programs which is anohter domain. My main objective is the proposal to make a conversion default when many users do not need it. For the UNIX* compatibility thing having conversion at lowest Expect that it seems a few br0ken programs yet does not support LF as end-of-line marker - so .gitattriutes make take special care here. Sam -
Yes, but I also think that even without .gitattributes, you just want to have a default for what "text" actually means, and it's entirely possible that the default should be: "check out with just LF, and on check-in turn CRLF into LF". But exactly because _some_ programs might want to always see CRLF on input too, it should be overridable. Or maybe the default should be "turn into CRLF", and there should just be an option to make it check out as LF-only. Regardless, I think that is independent of ".gitattributes". The _attribute_ should be "text", but what it then means in practice is a separate flag. And yes, we *could* have a per-file attribute ("text,crlf-checkout") which could be used to say "I want to always check out as crlf regardless of any other policy") and the same for lf-only, but I seriously doubt that anybody really needs that kind of knob-tweaking. At some point it's just fine to say "you're crazy". Linus -
The definition of what is "text" and what action to take upon check-in / check-out of text is two sepearate things. I could see it as beneficial as a per-project or even as an overall git-policy to say "checkin-as-LF" - "checkout-as-LF" to overcome Which is where I see .gitattributes come into play. -> A rule that says files with extension .prj and of type "text" shall not see any conversion. In this way almost all "text" over time get a proper format and the remaining brain-dead tools that continue to save in CRLF format will not destroy the sane LF format. If anything gets defualt I would vote for LF. But overrideable. My editor-of-choice does eol auto-sense. If I recall correct it scans the first 200 lines and counts number of CR,LF,CRLF and based on this judge the actual eol character used. But not all editors are that sensible :-( Sam -
I've actually run into grief on this subject with perl scripts within the last year (files from windows systems with crlf not working cleanly on a linux system with just lf) this is real, not just historic David Lang -
MS VC has text file for project file but don't like \n line endings, only \r\n. -
It might be safe for some definition of safe, but it is very Asian unfriendly. I'd probably suggest replacing it with what GNU diff uses, which we stolen and implemented in diff.c::mmfile_is_binary(). -
Well, the thing is, mmfile_is_binary() doesn't really have a big downside if it's wrong one way or the other. In contrast CR->CRLF conversion, if wrong, actually corrupts binary files. So I felt it was better to be really safe than sorry. It's *much* better to miss some CRLF translation than to do too much of it. That said, I'm sure it could be improved a lot. In particular, characters in the range 0x00 - 0x1f are clearly "more binary" than the 0x7f+ range, with the obvious exceptions (tab, cr, lf). 0x00 - which is the only one mmfile_is_binart() uses - is arguably the "most binary" one, of course, but it might be interesting to give different weights to the whole range.. In particular, especially for small files, the fact that there is no 0x00 byte in no way indicates that it's not "binary". This whole issue is obviously one reason I'd like to involve the filename itself, and make it use a ".gitattributes" file - exactly because that allows you to be much more aggressive and more precise. (0x00 may be one of the more _common_ characters in many binary files, which makes it a good character to search for too, so I don't really have any hugely strong opinions here. After all, the whole heuristic is off by default anyway, so it's "really safe" ;^) Linus -
Hi, Last time I checked, the text files never had lines longer than 200 characters (I chose this intentionally large). So, it might be a good heuristic to check the maximal line length, and refuse to believe that it's text once a certain (configurable) threshold is reached. Ciao, Dscho -
Unfortunately, on my program we have folks using text files with single lines over 60,000 characters long, these are data files. Think for example of a comma or tab separated data file saved from a spreadsheet. In this case, the files are pure ascii. So, the line length could be something else to take into account, but is not decisive by itself. To recap, we have the following various suggestions to determine textness: 1) ratio of ascii to non-ascii characters, possibly weighting some chars more than others 2) line length 3) existence of a null (\0) 4) file name globbing 5) roundtrip ( lf(crlf(file) ) == file I don't think any one suggestion is completely adequate for all uses, all need to be available, somehow configurable. This suggests to me a core.AutoCRLFstrategy variable that is a comma separated list of methods to use (set to a reasonable default of course that does not cause runtime headaches on Unix): a file would be deemed binary unless all listed methods declare the file as text (with an empty list disabling AutoCRLF detection). Mark -
Actually, my patch already had one that you didn't mention: 6) CR never shows up alone. So the patch I sent out basicallyhad the following rules: - no more than ~10% of all characters being other than regular printable ASCII (where any control character except for newline/cr/tab was deemed nonprintable) - any "lonely" CR automatically means it's binary, and I would refuse to convert that to a LF (the test in the code is that CRLF count must match CR count) but the "roundtrip" rule is much too strict (it's actually perfectly possible for an editor to add CRLF characters only to new _lines_, leaving old lines with just LF - or the other way around. In fact, the editor I use under Linux does exactly that in reverse - if I add new lines, it will add those without CR, but will leave old lines with CRLF alone). I think that to help asian languages (or strange text-files in utf8 or Latin1 too, for that matter: test-files with _just_ special characters), I should probably make the rule be that only the 0-31 range is special. Linus -
I would agree. 0-31 except HT, CR, LF and ESC would be a good
idea; that would not harm text in UTF-8, EUC based various
locales nor ISO 2022.
Patch is relative to 'pu'.
-- >8 --
diff --git a/convert.c b/convert.c
index ebcf717..b6b7c66 100644
--- a/convert.c
+++ b/convert.c
@@ -13,7 +13,7 @@ struct text_stat {
unsigned cr, lf, crlf;
/* These are just approximations! */
- unsigned printable, nonprintable, nul;
+ unsigned printable, nonprintable;
};
static void gather_stats(const char *buf, unsigned long size, struct text_stat *stats)
@@ -34,13 +34,11 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
stats->lf++;
continue;
}
- if (c == '\t' || (c >= 32 && c < 127)) {
- stats->printable++;
+ if ((c < 32) && (c != '\t' && c != '\033')) {
+ stats->nonprintable++;
continue;
}
- if (!c)
- stats->nul++;
- stats->nonprintable++;
+ stats->printable++;
}
}
@@ -50,7 +48,7 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
static int is_binary(unsigned long size, struct text_stat *stats)
{
- if (stats->nul)
+ if (stats->nonprintable)
return 1;
/*
* Other heuristics? Average line length might be relevant,
-
Yeah, I think we can ignore them.. You could possibly add 127 to the list too (it's ascii DEL, I don't know But this is too harsh. It's quite common to have the occasional FF character. Some things really do use it for page breaks. So saying that *any* nonprintable character is bad is not a good idea. Same goes for BS (some programs use it to show bold and underlined text: man-pages, for example). Linus -
Ok. How about adding BS and FF to the Ok set, and checking if
bad ones are less than 1% of the good ones?
diff --git a/convert.c b/convert.c
index b6b7c66..b0c7641 100644
--- a/convert.c
+++ b/convert.c
@@ -34,11 +34,22 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
stats->lf++;
continue;
}
- if ((c < 32) && (c != '\t' && c != '\033')) {
+ if (c == 127)
+ /* DEL */
stats->nonprintable++;
- continue;
+ else if (c < 32) {
+ switch (c) {
+ /* BS, HT, ESC and FF */
+ case '\b': case '\t': case '\033': case '\014':
+ stats->printable++;
+ break;
+ default:
+ stats->nonprintable++;
+ }
+
}
- stats->printable++;
+ else
+ stats->printable++;
}
}
@@ -48,7 +59,7 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
static int is_binary(unsigned long size, struct text_stat *stats)
{
- if (stats->nonprintable)
+ if ((stats->printable >> 7) < stats->nonprintable)
return 1;
/*
* Other heuristics? Average line length might be relevant,
-
I think that looks fine. Linus -
Hi, This sounds regretfully complex. Somebody (you?) mentioned that cvsnt does a kick-ass job here. Does cvsnt need strategies? I don't think so. Neither do we. Someone who cares enough should just rip^H^H^Hlook at cvsnt's text detection. Ciao, Dscho -
Well, one thing to keep in mind is that for source code in particular, this really very seldom is an issue. So you can do a really *bad* job in theory, and in practice it really works very very well. Very few people keep binary blobs in any SCM archive _anyway_, partly because they've always been told that it's unsafe (and with a lot of SCM's it is), but even more because binary blobs are almost always generated by some build method, so normally you'd never version them in the first place, or versioning isn't all that helpful. And most binary blobs are so *obviously* binary that even the stupidest algorithm on earth will get it right. The only hard cases actually tend to be really tiny files, or literally test-sequences. Tiny files are hard because: - they (by being tiny) have so few characters that they can easily lack a "fingerprint" character (eg a NUL character or similar). - tiny files are a lot more likely than bigger files to have strange statistics that throw some more "sophisticated" rule off the scent. Something like a "10% rule" tends to work fine if you have a big text, and ten percent is still a reasonable number to average things out over, but what if you only had ten characters to begin with? The good news is that tiny files can usually be considered text, since you'd seldom use a binary format for something really small anyway. So I suspect that IN PRACTICE, especially if you come as a CVS replacement (where binary files are just damn hard to get right even under the best of circumstances!), you can do just about anything, including just saying "everything is text", and you'd be fine. It's entirely possible that that is exactly what CVSNT does ;) Linus -
I agree that is complex, I started thinking of PAM when I wrote that, leading to, "this aint gonna work." But in the modern day let's all feel good spirit of "there are no stupid ideas, just some are better" I threw it out anyway. As to cvsnt, my actual feeling is I'd like to kick it in the ass, it has destroyed too many files for me over the years, binary and text, so I don't think its strategies are very good. That is why I'm kicking these ideas around, if I thought I knew the "right" way I would have written it already. Mark -
That may be why an excellent piece of software, TortoiseCVS, doesn't trust cvs or cvsnt to do the job. Here is how they do the binary detection (and some more): http://tortoisecvs.cvs.sourceforge.net/tortoisecvs/TortoiseCVS/src/CVSGlue/CVSStatus.c... -- robin -
Well, it does seem to boil down to what Junio already got to: - 0-31 and 127 are never in text, except for BEL, BS, HT, LF, FF, CR and ESC. - 128-255 can all be in either iso-8859 or extended ascii (or they explicitly add NEL but not 128+27 to "normal ASCII", which is strange) So they've effectively added BEL and ESC to the listof characters that Junio has now. But they also make it an absolute error to have anything else (no "1% rule"). But they also do the filename tests, and I think that's more important in many ways. Linus -
Especially ESC used to be common in DOS/Windows and quite a few hang around in Can this 1%-rule be motivated from real cases, rather that hypotetical ones? It makes A unixy tool like git should maybe use magic too :). Btw the filename (like .gitignore or similar) test in practice would give us the binary flag. Just list a filename instead of a pattern. -- robin -
No, some broken editor programs and people use "flowing text" files, where a newline is actually a _paragraph_ end. You have lines in the hundreds (and thousands) of characters, and the program will just flow the text for you. Ugh. Horrible, I know. Linus -
Hi, Good point. Ciao, Dscho -
Wow ! Thanks. I just tried this patch and it works! From now I can use git-cvsimport under Linux and then clone it to cygwin and work there with full history. Nice, very nice. In my case text file detection work well as far most of our files are .cpp and .h -
Btw, it didn't do any commit message conversion etc, so you'll still always see commit messages with LF-only, and if you _create_ commits, you Yeah, considering that it worked in my testing for "git" itself, I'm not surprised. Source code tends to look the same.. Linus -
I think stripspace removes CR so we should be Ok. -
Hi, Why the haste all of a sudden? Your patch is easily applyable for anyone who wants to work on text/binary or arbitrary file types. No need to rush a developers-only patch into git.git. Ciao, Dscho -
git-cat-file, and its counter part git-hash-object, are fairly low level plumbing. Anyone using them should be aware of the issue and apply the needed conversion. And actually, since we're going to have the conversion routines in the core, we'd only need to add a --crlf argument to both of them to optionally perform the conversion since the user of those commands is more likely to know if the conversion is needed. Nicolas -
Hi, No hooks means something like cvsnt does, and that means no .gitattributes either. (BTW I really hate .gitattributes, as it does not at all say what this is about; it's about file _conversions_, not attributes). CVSNT analyzes the files, and guesses if they are text, and only then activates the text mode. I am strongly opposed to including something like that. (It was already proposed, and your "no hooks" suggests the same.) However, I am slightly positive about the .gitfiletypes approach, _iff_ we think about more than just text/binary from the start. If we do it right, it will buy us more. Ciao, Dscho -
We might start with only binary/text attributes, but we may add more later, e.g. chmod=o-rwx. I do not see much differnece between attributes vs filetypes. -
No, it *is* about attributes. In order to know how to convert, you need to know the attributes of the file. So it's not about conversion: we would ALWAYS do conversion. It's about the fact that in order to do the conversion, we need to know what the attributes of the file is - is it text, or what. And the equal point is that there are _other_ attributes that git might care about. The "merge strategy" attribute, for example. Or "owner" attributes for files etc. Linus -
Hi, Yes, you're right. Colour me converted (pun intended). Ciao, Dscho -
