Re: [PATCH] Use FIX_UTF8_MAC to enable conversion from UTF8-MAC to UTF8

Previous thread: [PATCH] parse_object_buffer: don't ignore errors from the object specific parsing functions by Martin Koegler on Monday, January 21, 2008 - 12:21 am. (1 message)

Next thread: Re: by MichaelTiloDressel@t-online.de on Monday, January 21, 2008 - 3:49 am. (3 messages)
From: Mark Junker
Date: Monday, January 21, 2008 - 2:12 am

Use FIX_UTF8_MAC to enable conversion from UTF8-MAC to UTF8 for readdir 
and get_pathspec.

I had to change get_pathspec too because otherwise git-add wouldn't work 
anymore because it uses the output of get_pathspec as strings to compare 
with the output of readdir.

I'm quite unsure because this is my first patch for the git project and 
I have several questions:

1. Is FIX_UTF8_MAC the right name for this "feature"?
2. Do I have to introduce a configuration option for this "feature"?

Signed-off-by: Mark Junker <mjscod@web.de>
---
  Makefile          |    5 +++++
  compat/readdir.c  |   26 ++++++++++++++++++++++++++
  git-compat-util.h |    5 +++++
  setup.c           |   12 ++++++++++++
  4 files changed, 48 insertions(+), 0 deletions(-)

diff --git a/Makefile b/Makefile
index 5aac0c0..e55914e 100644
--- a/Makefile
+++ b/Makefile
@@ -417,6 +417,7 @@ ifeq ($(uname_S),Darwin)
  	endif
  	NO_STRLCPY = YesPlease
  	NO_MEMMEM = YesPlease
+	FIX_UTF8_MAC = YesPlease
  endif
  ifeq ($(uname_S),SunOS)
  	NEEDS_SOCKET = YesPlease
@@ -616,6 +617,10 @@ ifdef NO_STRLCPY
  	COMPAT_CFLAGS += -DNO_STRLCPY
  	COMPAT_OBJS += compat/strlcpy.o
  endif
+ifdef FIX_UTF8_MAC
+	COMPAT_CFLAGS += -DFIX_UTF8_MAC
+	COMPAT_OBJS += compat/readdir.o
+endif
  ifdef NO_STRTOUMAX
  	COMPAT_CFLAGS += -DNO_STRTOUMAX
  	COMPAT_OBJS += compat/strtoumax.o
diff --git a/compat/readdir.c b/compat/readdir.c
new file mode 100644
index 0000000..045cfef
--- /dev/null
+++ b/compat/readdir.c
@@ -0,0 +1,26 @@
+#include "../git-compat-util.h"
+#include "../utf8.h"
+
+#undef readdir
+
+static struct dirent temp;
+
+struct dirent *gitreaddir(DIR *dirp)
+{
+	size_t utf8_len;
+	char *utf8;
+	struct dirent *result;
+	result = readdir(dirp);
+	if (result != NULL) {
+		memcpy(&temp, result, sizeof(struct dirent));
+		utf8 = reencode_string(temp.d_name, "UTF8", "UTF8-MAC");
+		if (utf8 != NULL) {
+			utf8_len = strlen(utf8);
+			temp.d_namlen = (u_int8_t) utf8_len;
+			memcpy(temp.d_name, utf8, ...
From: Mark Junker
Date: Monday, January 21, 2008 - 2:45 am

Sorry, the patch was broken (TABS were converted to spaces).

Signed-off-by: Mark Junker <mjscod@web.de>
---
  Makefile          |    5 +++++
  compat/readdir.c  |   26 ++++++++++++++++++++++++++
  git-compat-util.h |    5 +++++
  setup.c           |   12 ++++++++++++
  4 files changed, 48 insertions(+), 0 deletions(-)

diff --git a/Makefile b/Makefile
index 5aac0c0..e55914e 100644
--- a/Makefile
+++ b/Makefile
@@ -417,6 +417,7 @@ ifeq ($(uname_S),Darwin)
  	endif
  	NO_STRLCPY = YesPlease
  	NO_MEMMEM = YesPlease
+	FIX_UTF8_MAC = YesPlease
  endif
  ifeq ($(uname_S),SunOS)
  	NEEDS_SOCKET = YesPlease
@@ -616,6 +617,10 @@ ifdef NO_STRLCPY
  	COMPAT_CFLAGS += -DNO_STRLCPY
  	COMPAT_OBJS += compat/strlcpy.o
  endif
+ifdef FIX_UTF8_MAC
+	COMPAT_CFLAGS += -DFIX_UTF8_MAC
+	COMPAT_OBJS += compat/readdir.o
+endif
  ifdef NO_STRTOUMAX
  	COMPAT_CFLAGS += -DNO_STRTOUMAX
  	COMPAT_OBJS += compat/strtoumax.o
diff --git a/compat/readdir.c b/compat/readdir.c
new file mode 100644
index 0000000..045cfef
--- /dev/null
+++ b/compat/readdir.c
@@ -0,0 +1,26 @@
+#include "../git-compat-util.h"
+#include "../utf8.h"
+
+#undef readdir
+
+static struct dirent temp;
+
+struct dirent *gitreaddir(DIR *dirp)
+{
+	size_t utf8_len;
+	char *utf8;
+	struct dirent *result;
+	result = readdir(dirp);
+	if (result != NULL) {
+		memcpy(&temp, result, sizeof(struct dirent));
+		utf8 = reencode_string(temp.d_name, "UTF8", "UTF8-MAC");
+		if (utf8 != NULL) {
+			utf8_len = strlen(utf8);
+			temp.d_namlen = (u_int8_t) utf8_len;
+			memcpy(temp.d_name, utf8, utf8_len + 1);
+			free(utf8);
+			result = &temp;
+		}
+	}
+	return result;
+}
diff --git a/git-compat-util.h b/git-compat-util.h
index b6ef544..cd0233d 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -202,6 +202,11 @@ void *gitmemmem(const void *haystack, size_t 
haystacklen,
                  const void *needle, size_t needlelen);
  #endif

+#ifdef FIX_UTF8_MAC
+#define readdir gitreaddir
+struct dirent ...
From: Mark Junker
Date: Monday, January 21, 2008 - 2:50 am

ARGH! Sorry, I'll try it again later - when I fixed this stupid implicit 
conversion of TAB to SPC ...

Regards,
Mark

-

From: Mark Junker
Date: Monday, January 21, 2008 - 2:55 am

Hi,

here's the patch again - sent as an attachment. Sorry for any inconvenience.

Regards,
Mark
From: Junio C Hamano
Date: Monday, January 21, 2008 - 3:15 am

I do not know how Macintosh libc implements "struc dirent", but
this approach does not work in general.  For example, on Linux
boxes with glibc, "struct dirent" is defined like this (pardon
the funny indentation --- that is from the original):

        struct dirent
          {
        #ifndef __USE_FILE_OFFSET64
            __ino_t d_ino;
            __off_t d_off;
        #else
            __ino64_t d_ino;
            __off64_t d_off;
        #endif
            unsigned short int d_reclen;
            unsigned char d_type;
            char d_name[256];		/* We must not include limits.h! */
          };

yet you can obtain a path component longer than 256 bytes.
Apparently the library allocates longer d_name[] field than what
is shown to the user.

-

From: Mark Junker
Date: Monday, January 21, 2008 - 3:36 am

IMHO there is no need that this approach works in general because this 
is a fix for MacOSX systems only. I also use d_namlen which might not be 

This is not a problem either because on MacOSX we get decomposed UTF8 
and we always convert to composed UTF8. This means that the string 
returned from reencode_string will always be smaller than the original 
filename that had to be reencoded.

Regards,
Mark

-

From: Junio C Hamano
Date: Monday, January 21, 2008 - 4:04 am

It is not quite enough that this works Ok on MacOS, if you made
FIX_UTF8_MAC definable in the Makefile.  After all some friendly
and helpful Linux folks might want to enable it with their build
trying to help debugging, right?

In the short term, as long as it safely runs without overrunning
the buffer on MacOS, then that is fine, even though we will need
some protection to prevent this code from getting compiled and
used on Linux with glibc, which does have the issue.

I was specifically talking about this "static" thing.

+static struct dirent temp;


+struct dirent *gitreaddir(DIR *dirp)
+{
+	size_t utf8_len;
+	char *utf8;
+	struct dirent *result;
+	result = readdir(dirp);
+	if (result != NULL) {
+		memcpy(&temp, result, sizeof(struct dirent));
+		utf8 = reencode_string(temp.d_name, "UTF8", "UTF8-MAC");
+		if (utf8 != NULL) {
+			utf8_len = strlen(utf8);
+			temp.d_namlen = (u_int8_t) utf8_len;
+			memcpy(temp.d_name, utf8, utf8_len + 1);
+			free(utf8);
+			result = &temp;
+		}
+	}
+	return result;
+}

You memcpy() what the library gave you in *result to the
statically allocated "temp".  d_name[] in "temp" comes from the
structure definition in the user visible include file, which
could be much shorter than what the library gave you in *result.
The structure definition I showed in my message you are
responding to illustrates the issue.  If MacOS uses a similar
trick to define d_name[256] and sometimes returns much longer
name in *result, you are truncating the name by copying only the
first part of the structure and first 256 bytes of d_name[]. 

But you have a Mac, I don't, so as long as you have verified
that their header has enough room in statically allocated "temp"
to store longest possible name that can be returned from
readdir(), the code is Ok.  I was just being cautious, as I know
the above code has a problem on one platform.

-

From: Mark Junker
Date: Monday, January 21, 2008 - 4:43 am

Now I understand what you mean. Ok, I'll try to change this and make 
this work on other platforms too.

I didn't know that the readdir function is allowed to return something 
longer for d_name than the specified length.

Regards,
Mark

From: H. Peter Anvin
Date: Monday, January 21, 2008 - 9:08 pm

That's not true!  There are strings which gets longer when a composing 
normalization is applied.  Please see section 3.3 of Unicode Techical 
Report 36:

	http://www.unicode.org/reports/tr36/

 > People assume that NFC always composes, and thus is the same or
 > shorter length than the original source. However, some characters
 > decompose in NFC.

(NFC = Normalization Form Composing.)

U+1D160 MUSICAL SYMBOL EIGHT NOTE is given as an example with a 3x 
expansion factor when encoded in UTF-8 (I don't know what it expands to; 
seems odd to me.)

	-hpa
-

From: Linus Torvalds
Date: Monday, January 21, 2008 - 9:59 pm

Individual components are limited to 255 bytes by most filesystems 
(PATH_MAX is the whole path, not any individual component).

That said, you're right. It's not really a design requirement, and since 
you never get an array of "struct dirent", just a pointer to a single one, 
it would be perfectly normal and natural for "struct dirent" to be 
declared with a unsized d_name[].

It's also quite possible that some implementations might even have 
d_name[] not as an array, but as a pointer to somewhere else (POSIX may 
require it to be an array, I didn't check).

That said, I bet that Mark isn't the only one to have written code like 
that, so I suspect Mark's code probably works in practice pretty much 
everywhere, even if I don't think it's necessarily _required_ to work 
correctly.

I do suspect that if you really want to make this portable, and able to 
handle an expanding d_name[] too, I think you need to make sure you 
allocate a big-enough one. And if you worry about d_name perhaps being a 
pointer, that really does mean that you'd need to convert the 
system-supplied "struct dirent" into a "git_dirent_t" that you can 
control.

That said, I think this patch has a bigger problem, namely just 
fundamentally that

	char *utf8 = reencode_string(entry, "UTF8", "UTF8-MAC");

is just unbelievably slow. That's just not how it should be done.

First off, the common case is that the filename likely has everything in 
plain 7-bit ascii. So rather than re-encoding by default, the first thing 
to do is to just see if it even needs re-encoding. Even if it's as simple 
as saying "does it have any high bits at all", that's going to be a *huge* 
performance win.

So start off with something like

	int is_usascii(const char *p)
	{
		char c;

		do {
			c = *p++;
		} while (c > 0);
		return !c;
	}

and now you can do

	if (is_usascii(entry->d_name))
		return entry;

before you even *look* at re-encoding it (and this basically works for all 
cases - we really ...
From: Linus Torvalds
Date: Tuesday, January 22, 2008 - 12:16 am

Having thought about this some more, I'm starting to suspect that the 
"readdir()" wrapper thing won't work very well.

Yes, it will work on OS X, but for all the wrong reasons. It works there 
just because of the stupid normalization that OS X does both on filename 
input and output, so if we hook into readdir() and munge the name there, 
we'll still be able to use the munged name for lstat() and open().

However, we'll never be able to test it on a sane Unix system, and it 
won't ever be able to handle the case of a filesystem actually being 
Latin1 but git being asked to try to transparently convert it to utf-8 in 
order to work with others.

Because most of those readdir() calls will just be fed back not just to 
the filesystem as lstat() calls later, but also to the recursive directory 
traversal itself, so if we munge the name, we're also going to screw name 
lookup.

Again, as an OSX-only workaround it's probably acceptable, and perhaps 
that's the only thing to look at right now. But it does strike me as a 
design mistake to do it at that level.

It would be conceptually nicer to do it in "add_file_to_index()" instead. 
Ie anything that creates a "struct cache_entry" would do the 
conversion. 

So it would be good if somebody looked at what happens if you do the OSX 
hack in add_file_to_index() instead, and see if it works there..

		Linus
-

From: Junio C Hamano
Date: Tuesday, January 22, 2008 - 12:54 am

Yes, we would need a reverse conversion when going from index to
work tree, including entry.c, in order to be able to emulate
this on filesystems that do not take "equivalent" but different
names on open(), creat() and lstat().


-

From: Robin Rosenberg
Date: Tuesday, January 22, 2008 - 3:34 pm

From: Dmitry Potapov
Date: Tuesday, January 22, 2008 - 5:20 am

Yes, when I proposed the readdir() wrapper, I meant it to be as OS X
specific hack. Just because HFS+ munges names and does that by converting
them in the form that is HFS+ specific, we can safely convert then into
NFC, as we do not lose more information than it is lost already, and
more importantly, AFAIK, everything that a user types on Mac is in NFC,

Yes, but that is a separate issue, which unfortunately is much more
difficult to deal with. Basically, there are two approaches -- either
to wrap all input/output functions, or to find another point where it
is possible to convert names without re-writing too much code in Git.
It seems to me that the first approach may requires wrapping too much
functions, but looking at the code I am not sure that the second will
be much easier. There are many places where a filename in the local
encoding will interact with Git internal encoding used by repo.

If we spoke about Windows only, I would say that the first approach makes
much more sense, because all i/o functions used on Windows are already
wrappers over Unicode functions. So, converting UTF-8 <-> UTF-16 makes
much more sense than UTF-8 <-> some-local-encoding(*) <-> UTF-16.

(*) In fact, two different encodings for the same locale setting -- 

I don't think it is going to work, without changing a lot of code,
because filenames entered by user and those that are returned by
readdir() are different. Also, .gitignore or .gitattributes files will
have filenames in the form that differs from returned by readdir().


Dmitry
-

From: Dmitry Potapov
Date: Tuesday, January 22, 2008 - 4:57 am

Yes, starting with Mac OS X 10.2 there are functions for that.
http://developer.apple.com/qa/qa2001/qa1235.html

Anyway, even if iconv is to be used, I believe it should be possible to
avoid malloc here (I usually allocate 256 on stack and use malloc()/free()
only when I need more than that which in practice never happens!). It is
also avoidable to call iconv_open/iconv_close for each name by putting the
allocated descriptor for character set conversion into a static variable.
Thus leaving iconv() alone, which should not be big overhead provided that
it is done only for non-ASCII names.

Dmitry
-

From: Nicolas Pitre
Date: Tuesday, January 22, 2008 - 7:21 am

You need to use "signed char" here.  On ARM a char is unsigned by 
default.  That's the case on some other systems too.


Nicolas
-

From: Linus Torvalds
Date: Tuesday, January 22, 2008 - 8:58 am

Correct you are. Me bad.

		Linus
-

From: Johannes Schindelin
Date: Monday, January 21, 2008 - 4:24 am

Hi,


I hate three facts about this patch:

- it is too specific to the MacOSX filesystem issues (and better 
  alternatives have _already_ been proposed),

- it is a new feature and not a bug fix, very, _very_ late in the rc 
  cycle,

- it contains questions in the commit message? WTF?  Should it not be 
  marked as PATCH/RFC, possibly without a signoff to make sure that you 
  want to discuss it first?

It's possible I am grumpy because everybody and her dog seems to work on 
her little projects, while I listen to Junio and try to work with/on 
"master" already since a month.

Ciao,
Dscho

-

From: Junio C Hamano
Date: Monday, January 21, 2008 - 4:29 am

Sign-off is about "this is kosher, from licensing point of view"
and nothing else.  Please do not suggest otherwise.

I do not mind discussions during the feature freeze as long as
it stays at "feeler" level.
-

From: Mark Junker
Date: Monday, January 21, 2008 - 4:49 am

I know that there were proposed alternatives but I like to use git on 

It was never meant for inclusion now. I know that this is post-1.5.4 stuff.

Regards,
Mark

-

From: Johannes Schindelin
Date: Monday, January 21, 2008 - 5:09 am

Hi,


In this case, I am going to work on my suggestion myself.

Now.

Ciao,
Dscho

-

From: Johannes Schindelin
Date: Monday, January 21, 2008 - 12:14 pm

Hi,


I retract that.  Although I put some work into it, I agree that it is post 
1.5.4.

Plus, I do not want anybody to think that shouting and being a PITA buys 
him anything.

Ciao,
Dscho

-

Previous thread: [PATCH] parse_object_buffer: don't ignore errors from the object specific parsing functions by Martin Koegler on Monday, January 21, 2008 - 12:21 am. (1 message)

Next thread: Re: by MichaelTiloDressel@t-online.de on Monday, January 21, 2008 - 3:49 am. (3 messages)