Use FIX_UTF8_MAC to enable conversion from UTF8-MAC to UTF8 for readdir
and get_pathspec.
I had to change get_pathspec too because otherwise git-add wouldn't work
anymore because it uses the output of get_pathspec as strings to compare
with the output of readdir.
I'm quite unsure because this is my first patch for the git project and
I have several questions:
1. Is FIX_UTF8_MAC the right name for this "feature"?
2. Do I have to introduce a configuration option for this "feature"?
Signed-off-by: Mark Junker <mjscod@web.de>
---
Makefile | 5 +++++
compat/readdir.c | 26 ++++++++++++++++++++++++++
git-compat-util.h | 5 +++++
setup.c | 12 ++++++++++++
4 files changed, 48 insertions(+), 0 deletions(-)
diff --git a/Makefile b/Makefile
index 5aac0c0..e55914e 100644
--- a/Makefile
+++ b/Makefile
@@ -417,6 +417,7 @@ ifeq ($(uname_S),Darwin)
endif
NO_STRLCPY = YesPlease
NO_MEMMEM = YesPlease
+ FIX_UTF8_MAC = YesPlease
endif
ifeq ($(uname_S),SunOS)
NEEDS_SOCKET = YesPlease
@@ -616,6 +617,10 @@ ifdef NO_STRLCPY
COMPAT_CFLAGS += -DNO_STRLCPY
COMPAT_OBJS += compat/strlcpy.o
endif
+ifdef FIX_UTF8_MAC
+ COMPAT_CFLAGS += -DFIX_UTF8_MAC
+ COMPAT_OBJS += compat/readdir.o
+endif
ifdef NO_STRTOUMAX
COMPAT_CFLAGS += -DNO_STRTOUMAX
COMPAT_OBJS += compat/strtoumax.o
diff --git a/compat/readdir.c b/compat/readdir.c
new file mode 100644
index 0000000..045cfef
--- /dev/null
+++ b/compat/readdir.c
@@ -0,0 +1,26 @@
+#include "../git-compat-util.h"
+#include "../utf8.h"
+
+#undef readdir
+
+static struct dirent temp;
+
+struct dirent *gitreaddir(DIR *dirp)
+{
+ size_t utf8_len;
+ char *utf8;
+ struct dirent *result;
+ result = readdir(dirp);
+ if (result != NULL) {
+ memcpy(&temp, result, sizeof(struct dirent));
+ utf8 = reencode_string(temp.d_name, "UTF8", "UTF8-MAC");
+ if (utf8 != NULL) {
+ utf8_len = strlen(utf8);
+ temp.d_namlen = (u_int8_t) utf8_len;
+ memcpy(temp.d_name, utf8, ...Sorry, the patch was broken (TABS were converted to spaces).
Signed-off-by: Mark Junker <mjscod@web.de>
---
Makefile | 5 +++++
compat/readdir.c | 26 ++++++++++++++++++++++++++
git-compat-util.h | 5 +++++
setup.c | 12 ++++++++++++
4 files changed, 48 insertions(+), 0 deletions(-)
diff --git a/Makefile b/Makefile
index 5aac0c0..e55914e 100644
--- a/Makefile
+++ b/Makefile
@@ -417,6 +417,7 @@ ifeq ($(uname_S),Darwin)
endif
NO_STRLCPY = YesPlease
NO_MEMMEM = YesPlease
+ FIX_UTF8_MAC = YesPlease
endif
ifeq ($(uname_S),SunOS)
NEEDS_SOCKET = YesPlease
@@ -616,6 +617,10 @@ ifdef NO_STRLCPY
COMPAT_CFLAGS += -DNO_STRLCPY
COMPAT_OBJS += compat/strlcpy.o
endif
+ifdef FIX_UTF8_MAC
+ COMPAT_CFLAGS += -DFIX_UTF8_MAC
+ COMPAT_OBJS += compat/readdir.o
+endif
ifdef NO_STRTOUMAX
COMPAT_CFLAGS += -DNO_STRTOUMAX
COMPAT_OBJS += compat/strtoumax.o
diff --git a/compat/readdir.c b/compat/readdir.c
new file mode 100644
index 0000000..045cfef
--- /dev/null
+++ b/compat/readdir.c
@@ -0,0 +1,26 @@
+#include "../git-compat-util.h"
+#include "../utf8.h"
+
+#undef readdir
+
+static struct dirent temp;
+
+struct dirent *gitreaddir(DIR *dirp)
+{
+ size_t utf8_len;
+ char *utf8;
+ struct dirent *result;
+ result = readdir(dirp);
+ if (result != NULL) {
+ memcpy(&temp, result, sizeof(struct dirent));
+ utf8 = reencode_string(temp.d_name, "UTF8", "UTF8-MAC");
+ if (utf8 != NULL) {
+ utf8_len = strlen(utf8);
+ temp.d_namlen = (u_int8_t) utf8_len;
+ memcpy(temp.d_name, utf8, utf8_len + 1);
+ free(utf8);
+ result = &temp;
+ }
+ }
+ return result;
+}
diff --git a/git-compat-util.h b/git-compat-util.h
index b6ef544..cd0233d 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -202,6 +202,11 @@ void *gitmemmem(const void *haystack, size_t
haystacklen,
const void *needle, size_t needlelen);
#endif
+#ifdef FIX_UTF8_MAC
+#define readdir gitreaddir
+struct dirent ...ARGH! Sorry, I'll try it again later - when I fixed this stupid implicit conversion of TAB to SPC ... Regards, Mark -
Hi, here's the patch again - sent as an attachment. Sorry for any inconvenience. Regards, Mark
I do not know how Macintosh libc implements "struc dirent", but
this approach does not work in general. For example, on Linux
boxes with glibc, "struct dirent" is defined like this (pardon
the funny indentation --- that is from the original):
struct dirent
{
#ifndef __USE_FILE_OFFSET64
__ino_t d_ino;
__off_t d_off;
#else
__ino64_t d_ino;
__off64_t d_off;
#endif
unsigned short int d_reclen;
unsigned char d_type;
char d_name[256]; /* We must not include limits.h! */
};
yet you can obtain a path component longer than 256 bytes.
Apparently the library allocates longer d_name[] field than what
is shown to the user.
-
IMHO there is no need that this approach works in general because this is a fix for MacOSX systems only. I also use d_namlen which might not be This is not a problem either because on MacOSX we get decomposed UTF8 and we always convert to composed UTF8. This means that the string returned from reencode_string will always be smaller than the original filename that had to be reencoded. Regards, Mark -
It is not quite enough that this works Ok on MacOS, if you made
FIX_UTF8_MAC definable in the Makefile. After all some friendly
and helpful Linux folks might want to enable it with their build
trying to help debugging, right?
In the short term, as long as it safely runs without overrunning
the buffer on MacOS, then that is fine, even though we will need
some protection to prevent this code from getting compiled and
used on Linux with glibc, which does have the issue.
I was specifically talking about this "static" thing.
+static struct dirent temp;
+struct dirent *gitreaddir(DIR *dirp)
+{
+ size_t utf8_len;
+ char *utf8;
+ struct dirent *result;
+ result = readdir(dirp);
+ if (result != NULL) {
+ memcpy(&temp, result, sizeof(struct dirent));
+ utf8 = reencode_string(temp.d_name, "UTF8", "UTF8-MAC");
+ if (utf8 != NULL) {
+ utf8_len = strlen(utf8);
+ temp.d_namlen = (u_int8_t) utf8_len;
+ memcpy(temp.d_name, utf8, utf8_len + 1);
+ free(utf8);
+ result = &temp;
+ }
+ }
+ return result;
+}
You memcpy() what the library gave you in *result to the
statically allocated "temp". d_name[] in "temp" comes from the
structure definition in the user visible include file, which
could be much shorter than what the library gave you in *result.
The structure definition I showed in my message you are
responding to illustrates the issue. If MacOS uses a similar
trick to define d_name[256] and sometimes returns much longer
name in *result, you are truncating the name by copying only the
first part of the structure and first 256 bytes of d_name[].
But you have a Mac, I don't, so as long as you have verified
that their header has enough room in statically allocated "temp"
to store longest possible name that can be returned from
readdir(), the code is Ok. I was just being cautious, as I know
the above code has a problem on one platform.
-
Now I understand what you mean. Ok, I'll try to change this and make this work on other platforms too. I didn't know that the readdir function is allowed to return something longer for d_name than the specified length. Regards, Mark
That's not true! There are strings which gets longer when a composing normalization is applied. Please see section 3.3 of Unicode Techical Report 36: http://www.unicode.org/reports/tr36/ > People assume that NFC always composes, and thus is the same or > shorter length than the original source. However, some characters > decompose in NFC. (NFC = Normalization Form Composing.) U+1D160 MUSICAL SYMBOL EIGHT NOTE is given as an example with a 3x expansion factor when encoded in UTF-8 (I don't know what it expands to; seems odd to me.) -hpa -
Individual components are limited to 255 bytes by most filesystems
(PATH_MAX is the whole path, not any individual component).
That said, you're right. It's not really a design requirement, and since
you never get an array of "struct dirent", just a pointer to a single one,
it would be perfectly normal and natural for "struct dirent" to be
declared with a unsized d_name[].
It's also quite possible that some implementations might even have
d_name[] not as an array, but as a pointer to somewhere else (POSIX may
require it to be an array, I didn't check).
That said, I bet that Mark isn't the only one to have written code like
that, so I suspect Mark's code probably works in practice pretty much
everywhere, even if I don't think it's necessarily _required_ to work
correctly.
I do suspect that if you really want to make this portable, and able to
handle an expanding d_name[] too, I think you need to make sure you
allocate a big-enough one. And if you worry about d_name perhaps being a
pointer, that really does mean that you'd need to convert the
system-supplied "struct dirent" into a "git_dirent_t" that you can
control.
That said, I think this patch has a bigger problem, namely just
fundamentally that
char *utf8 = reencode_string(entry, "UTF8", "UTF8-MAC");
is just unbelievably slow. That's just not how it should be done.
First off, the common case is that the filename likely has everything in
plain 7-bit ascii. So rather than re-encoding by default, the first thing
to do is to just see if it even needs re-encoding. Even if it's as simple
as saying "does it have any high bits at all", that's going to be a *huge*
performance win.
So start off with something like
int is_usascii(const char *p)
{
char c;
do {
c = *p++;
} while (c > 0);
return !c;
}
and now you can do
if (is_usascii(entry->d_name))
return entry;
before you even *look* at re-encoding it (and this basically works for all
cases - we really ...Having thought about this some more, I'm starting to suspect that the "readdir()" wrapper thing won't work very well. Yes, it will work on OS X, but for all the wrong reasons. It works there just because of the stupid normalization that OS X does both on filename input and output, so if we hook into readdir() and munge the name there, we'll still be able to use the munged name for lstat() and open(). However, we'll never be able to test it on a sane Unix system, and it won't ever be able to handle the case of a filesystem actually being Latin1 but git being asked to try to transparently convert it to utf-8 in order to work with others. Because most of those readdir() calls will just be fed back not just to the filesystem as lstat() calls later, but also to the recursive directory traversal itself, so if we munge the name, we're also going to screw name lookup. Again, as an OSX-only workaround it's probably acceptable, and perhaps that's the only thing to look at right now. But it does strike me as a design mistake to do it at that level. It would be conceptually nicer to do it in "add_file_to_index()" instead. Ie anything that creates a "struct cache_entry" would do the conversion. So it would be good if somebody looked at what happens if you do the OSX hack in add_file_to_index() instead, and see if it works there.. Linus -
Yes, we would need a reverse conversion when going from index to work tree, including entry.c, in order to be able to emulate this on filesystems that do not take "equivalent" but different names on open(), creat() and lstat(). -
About this size: http://rosenberg.homelinux.net/cgi-bin/gitweb/gitweb.cgi?p=GIT.git;a=commitdiff;h=766d... And messed up expanded tests: http://rosenberg.homelinux.net/cgi-bin/gitweb/gitweb.cgi?p=GIT.git;a=commitdiff;h=5d73... Junio: @Maybe in five years@, you said.. Four more to go. -- robin -
Yes, when I proposed the readdir() wrapper, I meant it to be as OS X specific hack. Just because HFS+ munges names and does that by converting them in the form that is HFS+ specific, we can safely convert then into NFC, as we do not lose more information than it is lost already, and more importantly, AFAIK, everything that a user types on Mac is in NFC, Yes, but that is a separate issue, which unfortunately is much more difficult to deal with. Basically, there are two approaches -- either to wrap all input/output functions, or to find another point where it is possible to convert names without re-writing too much code in Git. It seems to me that the first approach may requires wrapping too much functions, but looking at the code I am not sure that the second will be much easier. There are many places where a filename in the local encoding will interact with Git internal encoding used by repo. If we spoke about Windows only, I would say that the first approach makes much more sense, because all i/o functions used on Windows are already wrappers over Unicode functions. So, converting UTF-8 <-> UTF-16 makes much more sense than UTF-8 <-> some-local-encoding(*) <-> UTF-16. (*) In fact, two different encodings for the same locale setting -- I don't think it is going to work, without changing a lot of code, because filenames entered by user and those that are returned by readdir() are different. Also, .gitignore or .gitattributes files will have filenames in the form that differs from returned by readdir(). Dmitry -
Yes, starting with Mac OS X 10.2 there are functions for that. http://developer.apple.com/qa/qa2001/qa1235.html Anyway, even if iconv is to be used, I believe it should be possible to avoid malloc here (I usually allocate 256 on stack and use malloc()/free() only when I need more than that which in practice never happens!). It is also avoidable to call iconv_open/iconv_close for each name by putting the allocated descriptor for character set conversion into a static variable. Thus leaving iconv() alone, which should not be big overhead provided that it is done only for non-ASCII names. Dmitry -
You need to use "signed char" here. On ARM a char is unsigned by default. That's the case on some other systems too. Nicolas -
Hi, I hate three facts about this patch: - it is too specific to the MacOSX filesystem issues (and better alternatives have _already_ been proposed), - it is a new feature and not a bug fix, very, _very_ late in the rc cycle, - it contains questions in the commit message? WTF? Should it not be marked as PATCH/RFC, possibly without a signoff to make sure that you want to discuss it first? It's possible I am grumpy because everybody and her dog seems to work on her little projects, while I listen to Junio and try to work with/on "master" already since a month. Ciao, Dscho -
Sign-off is about "this is kosher, from licensing point of view" and nothing else. Please do not suggest otherwise. I do not mind discussions during the feature freeze as long as it stays at "feeler" level. -
I know that there were proposed alternatives but I like to use git on It was never meant for inclusion now. I know that this is post-1.5.4 stuff. Regards, Mark -
Hi, In this case, I am going to work on my suggestion myself. Now. Ciao, Dscho -
Hi, I retract that. Although I put some work into it, I agree that it is post 1.5.4. Plus, I do not want anybody to think that shouting and being a PITA buys him anything. Ciao, Dscho -
