Re: [PATCH] Add a new lstat implementation based on Win32 API, and make stat use that implementation too.

Previous thread: [PATCH 0/5] gitweb: Support for arbitrary diffs by Martin Koegler on Sunday, September 2, 2007 - 7:46 am. (7 messages)

Next thread: Re: git-gui i18n status? by Michele Ballabio on Sunday, September 2, 2007 - 8:09 am. (1 message)
From: Marius Storm-Olsen
Subject: Stats in Git
Date: Sunday, September 2, 2007 - 7:49 am

I was checking out the performance situation with Git on Windows, and
found out that the Posix stat functions on Windows are just obscenely
slow. We really can't use them, at least on in Git. So, I made a patch
for the MinGW version, which I'll post right after this mail.

However, while look at that whole stat'ing situation in git, I saw
that doing 'git status' actually stats all the files _thrice_!
Yup, that's not 1 time, or 2 times, but actually 3(!) times before
'git status' is content!
I know that git-status is a script, so I think this clearly indicates
that git-status is a prime candidate for a built-in ;-)

I haven't looked into details as to why it stats the files so many
times. I guess someone more experienced in Git core could give an
opinion, if by writing git-status as a builtin it would be possible to
only stat the files once. It would have a huge impact on Windows where
stats are inheritly much slower than on Linux.

By applying the diff below, you can see for yourself what happens when
you stat the repo created with Moe's script:
    mkdir bummer
    cd bummer
    for ((i=0;i<100;i++)); do
    mkdir $i && pushd $i;
    for ((j=0;j<1000;j++)); do
    echo "$j" >$j; done; popd;
    done

$ git status 2>&1 | wc -l
300137

Fast on Linux now, but still quite slow on Windows..

--
.marius


diff --git a/git-compat-util.h b/git-compat-util.h
index ca0a597..6b6405c 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -369,4 +369,23 @@ static inline int strtoul_ui(char const *s, int base, unsigned int *result)
 	return 0;
 }

+static inline int git_lstat(const char *file_name, struct stat *buf)
+{
+	fprintf(stderr, "lstat: %s\n", file_name);
+	return lstat(file_name, buf);
+}
+static inline int git_fstat(int fd, struct stat *buf)
+{
+	fprintf(stderr, "fstat: %d\n", fd);
+	return fstat(fd, buf);
+}
+static inline int git_stat(const char *file_name, struct stat *buf)
+{
+	fprintf(stderr, "stat: %s\n", file_name);
+	return stat(file_name, ...
From: Marius Storm-Olsen
Date: Sunday, September 2, 2007 - 7:51 am

This gives us a significant speedup when adding, committing and stat'ing files.
(Also, since Windows doesn't really handle symlinks, it's fine that stat just uses lstat)

Signed-off-by: Marius Storm-Olsen <mstormo_git@storm-olsen.com>
---
The following patch will override the normal Posix implementation of stat and
lstat on Windows, and use normal Windows API to ensure we're stat'ing as fast
as possible. With this patch I get the performance increase to the far right.

Initially, I only replaced lstat, since that's what the MinGW port of Git had
implemented from before. But, since we don't really care about symlinks on
Windows, I decided to simply use the same implementation for stat as well. The
performance benefit is clearly indicated in the report below.


 With normal lstat & stat   lstat, based on Win32      lstat & stat, as Win32
 -------------------------  -------------------------  -------------------------
 Command: git init          Command: git init          Command: git init
 -------------------------  -------------------------  -------------------------

 real    0m0.047s           real      0m0.047s         real       0m0.078s
 user    0m0.031s           user      0m0.031s         user       0m0.031s
 sys     0m0.000s           sys       0m0.000s         sys        0m0.000s

 -------------------------  -------------------------  -------------------------
 Command: git add .         Command: git add .         Command: git add .
 -------------------------  -------------------------  -------------------------

 real    0m19.390s          real      0m19.390s        real       0m12.187s
 user    0m0.015s           user      0m0.015s         user       0m0.015s
 sys     0m0.030s           sys       0m0.030s         sys        0m0.015s

 -------------------------  -------------------------  -------------------------
 Command: git commit -a...  Command: git commit -a...  Command: git commit -a...
 -------------------------  -------------------------  ...
From: Marius Storm-Olsen
Date: Sunday, September 2, 2007 - 7:57 am

Sorry, should have added that this is a MinGW port patch only. And I
forgot to include the msysgit mailinglist, nice..

--
.marius

-

From: Reece Dunn
Date: Sunday, September 2, 2007 - 8:32 am

This breaks executable mode reporting for things like configure
scripts and other shell scripts that may, or may not, be executable.
Also, you may want to turn off the executable state for some of these
extensions (for example if com or cmd were not actually executable
files). This makes it impossible to manipulate git repositories
properly on the MinGW platform.

Would it be possible to use the git tree to manage the executable
state? That way, all files would not have their executable state set
by default on Windows. The problem with this is how then to set the
executable state? Having a git version of chmod may not be a good
idea, but then how else are you going to reliably and efficiently
modify the files permissions on Windows?

The rest of the patch looks good on a brief initial scan.

- Reece
-

From: Marius Storm-Olsen
Date: Sunday, September 2, 2007 - 9:09 am

Actually, you don't really need the EXEC bit for Git to work. I just
added it for completeness. (We _could_ remove that too, since it's
slowing us down slightly ;-)

Remember that Git isn't using MSys for its builtins, so MinGW Git
doesn't understand the MSys notion of executable files anyways.
The MinGW port actually peeks at the beginning of a file (ignoring exe
files), and sees if there's an interpreter there. If there is, it will
expand
    git-foo args...
into
    sh git-foo args...
and execute the command. So, it's not really affected by this change.

I haven't had any problems with this patch on my system, so could you
explain what you mean with 'this makes it impossible to manipulate git

The file-state-in-git-tree belongs in a different discussion. Have a
look at the '.gitignore, .gitattributes, .gitmodules, .gitprecious?,
=2Egitacls? etc.' and 'tracking perms/ownership [was: empty directories]'=

threads. Permissions are not a trivial topic, since systems represent
them differently. This patch just tries to reflect the read, write and
execute permissions as normal Windows would; and it only cares about
file extensions (and the PE header, if it exists).

Also note that my patch totally ignores the Group & Others part of the
permission bits. Again, we're on Windows so we don't really care. We
_could_ make it reflect the ACLs in Windows, but then we'd have to make

Thanks

--
=2Emarius

From: Reece Dunn
Date: Sunday, September 2, 2007 - 9:33 am

You pull a repository that contains executable scripts that are
required to work in order to build the system. You then make some
modifications to the local repository and run the 'git add .' command.
Since this patch is reporting executable bits differently, the mode
change is stored as well as the local modifications. Now the changes
are pushed upstream (along with the file mode changes).

Someone running a Linux machine, pulls your changes. When those files
are checked out, the executable state of those scripts has now
changed, preventing the Linux user from running those scripts. _That_

I understand that this is not a trivial topic. I was thinking that
this different behaviour w.r.t. the executable permission will break
things when you have developers on both Linux and Windows, such as the
cairo developers, for current git usage.


Sure. Cygwin does use ACLs to implement stat which is why it is slow.
So anything that can speed git up here, without any breakage in
functionality, is a good thing.

- Reece
-

From: Brian Gernhardt
Date: Sunday, September 2, 2007 - 9:47 am

This is what "git config core.fileMode false" is for.  See git- 
config's man page for information (or Documentation/config.txt).

We already have a way to tell git that the "executable bit" is  
worthless, and any Windows port should use it.

~~ B
-

From: Reece Dunn
Date: Sunday, September 2, 2007 - 9:53 am

Ok, so as the executable bit is worthless, there doesn't need to be
any special casing in this patch to deal with it.

- Reece
-

From: Marius Storm-Olsen
Date: Sunday, September 2, 2007 - 10:05 am

Right, this is true. And I was debating it with myself, and just added
it for completion; at least for the first revision of the patch. It
doesn't really affect the performance all that much anyways. The
conversion of the FileTime to unix time_t is far more heavy. (Which is
why I'm debating to just ignore the access time)

If we could somehow rather use the FileTimes directly in the index,
instead of having to convert them, we could have even better performance
when stat'ing on Windows. (However, it would result in an incompatible
index, so everyone would have to 'git update-index --refresh' on all
repositories before they can use the new version.)

--
=2Emarius

From: Johannes Schindelin
Date: Sunday, September 2, 2007 - 10:44 am

Hi,


Really?  If so, we might consider storing FILETIME->dwHightDateTime and 
->dwLowDateTime in the index.

But I doubt it.  AFAICT _getting_ at the stat data is the expensive thing 
in Windows, not a 64-bit addition, subtraction and division.

Ciao,
Dscho

-

From: Marius Storm-Olsen
Date: Sunday, September 2, 2007 - 11:18 am

Haha, sure sure, _getting_ that stat data in the first place is the
expensive part on Windows. However, that's something you _have_ to do no
matter what, so there's no way around that.

Turns out that it wasn't as bad as i thought. If you have
filetime_to_time_t() just return, say 116444736, I see
    git add . improve with ~0.5 sec for 100K files
and git status improve with 0.05 sec

Surely, avoiding the tripple stat'ing in 'git status' would help a lot
more ;-) So, I guess we'll just leave the timestamp conversion as is,
and avoid complicating the index.

--
=2Emarius

From: Johannes Sixt
Date: Sunday, September 2, 2007 - 11:16 am

Your numbers show an improvement of 50% and more. That is terrific!

I'll test it out an put the patch into mingw.git. I hope you don't mind if I 
also include your analysis and statistics in the commit message. It's worth 
keeping around! BTW, which of your email addresses would you like registered 

I'm slightly negative about this. For a native Windows project the executable 
bit does not matter, and for a cross-platform project this check is not 
sufficient, but can even become annoying (think of a file 

Here's an idea for the future: With this self-made stat() implementation it 
should also be possible to get rid of Windows's native struct stat: Make a 

Of course we need a bit more detailed error conditions, most importantly 

I'd go the short route without git_stat() and

#define stat(x,y) git_lstat(x,y)

-- Hannes
-

From: Marius Storm-Olsen
Date: Sunday, September 2, 2007 - 11:44 am

Yes, I was surprised myself about the impact. Didn't think it would make
_such_ a difference. And if you compare it to the results we had before
Linus' performance fix too, just checkout the performance improvements
we've had (based on Moe's test script):

    Before                        Now
    -------------------------     -------------------------
    Command: git init             Command: git init
    -------------------------     -------------------------
    real    0m0.031s              real       0m0.078s
    user    0m0.031s              user       0m0.031s
    sys     0m0.000s              sys        0m0.000s
    -------------------------     -------------------------
    Command: git add .            Command: git add .
    -------------------------     -------------------------
    real    0m19.328s             real       0m12.187s
    user    0m0.015s              user       0m0.015s
    sys     0m0.015s              sys        0m0.015s
    -------------------------     -------------------------
    Command: git commit -a...     Command: git commit -a...
    -------------------------     -------------------------
    real    0m30.937s             real       0m17.297s
    user    0m0.015s              user       0m0.015s
    sys     0m0.015s              sys        0m0.015s
    -------------------------     -------------------------
    3x Command: git-status        3x Command: git-status
    -------------------------     -------------------------
    real    0m19.531s             real       0m5.344s
    user    0m0.211s              user       0m0.015s
    sys     0m0.136s              sys        0m0.031s

    real    0m19.532s             real       0m5.390s
    user    0m0.259s              user       0m0.031s
    sys     0m0.091s              sys        0m0.000s

    real    0m19.593s             real       0m5.344s
    user    0m0.211s              user       0m0.015s
    sys     0m0.152s              sys        0m0.016s
    -------------------------   ...
From: Johannes Sixt
Date: Sunday, September 2, 2007 - 12:07 pm

Yes, please. Please don't forget to take care of the trailing-slash annoyance.

-- Hannes
-

From: Marius Storm-Olsen
Date: Sunday, September 2, 2007 - 12:31 pm

This gives us a significant speedup when adding, committing and stat'ing files.
(Also, since Windows doesn't really handle symlinks, it's fine that stat just uses lstat)

Signed-off-by: Marius Storm-Olsen <mstormo_git@storm-olsen.com>
---
 Revision #2 of the patch.
 For this one I change the filetime_to_time_t function to do the
 timestamp conversion inline in the FILETIME struct. That way we
 also avoid one assignment, bitshifting and addition.
 Sneaky, huh? ;-)
 New stats:
    -------------------------
    Command: git init
    -------------------------

    real    0m0.047s
    user    0m0.031s
    sys     0m0.000s

    -------------------------
    Command: git add .
    -------------------------

    real    0m12.016s
    user    0m0.015s
    sys     0m0.000s

    -------------------------
    Command: git commit -a...
    -------------------------

    real    0m17.031s
    user    0m0.015s
    sys     0m0.030s

    -------------------------
    3x Command: git-status
    -------------------------

    real    0m5.265s
    user    0m0.015s
    sys     0m0.015s

    real    0m5.297s
    user    0m0.015s
    sys     0m0.000s

    real    0m5.250s
    user    0m0.015s
    sys     0m0.016s

    -------------------------
    Command: git commit...
    (single file)
    -------------------------

    real    0m7.859s
    user    0m0.015s
    sys     0m0.015s


 compat/mingw.c    |   41 +++++++++++++++++++++++++++++++++--------
 git-compat-util.h |    4 ++++
 2 files changed, 37 insertions(+), 8 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index 7711a3f..86a1419 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -23,19 +23,44 @@ int fchmod(int fildes, mode_t mode)
 	return -1;
 }

-int lstat(const char *file_name, struct stat *buf)
+static inline time_t filetime_to_time_t(const FILETIME *ft)
+{
+	long long *winTime = (long long*)ft;
+	*winTime -= 116444736000000000LL; /* Windows to Unix Epoch conversion */
+	*winTime ...
From: Johannes Schindelin
Date: Sunday, September 2, 2007 - 2:26 pm

Hi,


Oh?  *goes and tries to create one on a USB stick* No.  Besides, IIRC you 
cannot even create symlinks to another partition.  Copying a symlink will 
copy the _linked_ file.  So to call this "symlink" is a little... uhm... 
preposterous.

Plus, on a page linked from the link you posted, it says that it is 
only supported from Vista onwards.  So you must be kidding me.

Ciao,
Dscho

-

From: Johannes Schindelin
Date: Sunday, September 2, 2007 - 4:02 pm

Hi,


Like almost every developer in the corporate world?

Fact is: this support of symlinks is ridiculous.  Why not just admit it?

Ciao,
Dscho

-

From: Johannes Sixt
Date: Monday, September 3, 2007 - 12:07 am

And if I understand the documentation correctly, then

$ mkdir foo && cd foo
$ cat ../x
x: No such file or directory

Right?

The docs say that symlinks without backslash are relative to the current 
directory (!!!).

-- Hannes

-

From: Alex Riesen
Date: Sunday, September 2, 2007 - 2:38 pm

Robin Rosenberg, Sun, Sep 02, 2007 22:27:59 +0200:
> s
From: Marius Storm-Olsen
Date: Sunday, September 2, 2007 - 11:15 pm

Yeah, I know about Vista's improved support for symbolic links.
However, I think we can let that lay for a while, until we decide to=20
make Git generate proper symlinks on Vista. I don't see it as a 1st=20
priority at the moment, and we can always add the needed functionality=20
in a separate stat() function later.

--=20
=2Emarius

From: Johannes Schindelin
Date: Monday, September 3, 2007 - 4:39 am

Hi,


... and force everybody to upgrade to Vista, thereby working for Microsoft 
for free?  You _know_ that I will oppose that change.

Ciao,
Dscho

-

From: Marius Storm-Olsen
Date: Monday, September 3, 2007 - 4:53 am

;-) I wouldn't dream of it!
Nah, my comment was more 'allow usage of proper Symlinks on Vista' at=20
a later point. I would still argue that the default would be what we=20
have today. So, it would have to be an option.
But seeing what they've done to the symlinks there, it might be far=20
fetched. We'll worry about that (much) later..

--=20
=2Emarius

From: Johannes Schindelin
Date: Monday, September 3, 2007 - 5:33 am

Hi,




Yes, it is funny how they do it over and over and over again.  Embrace, 
"Extend", Extinguish.  And I thought that eventually people would be 
clever enough to realise...

Ciao,
Dscho

-

From: Alex Riesen
Date: Sunday, September 2, 2007 - 2:41 pm

You sure about that? Ever wondered why it is not so on everywhere else?

-

From: Marius Storm-Olsen
Date: Sunday, September 2, 2007 - 11:12 pm

Pretty sure. If you look at Windows' native version of stat, it will
return you st_ino =3D 0. Or maybe you where referring to something else,
and I just missed your point?

AFAIK, the ino in the index is only to be _really_ sure that nothing
has changed with the file, and we can just skip it on Windows. If in
doubt, try running this on your Windows box:

#include <windows.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>

int main(int, char **)
{
    wchar_t DirSpec[] =3D L".\\*";
    WIN32_FIND_DATA FindFileData;
    HANDLE hFind =3D FindFirstFile(DirSpec, &FindFileData);
    if (hFind =3D=3D INVALID_HANDLE_VALUE) {
        printf ("Crap happened: %u\n", GetLastError());
        return -1;
    }=20

    struct _stat buf;
    while (FindNextFile(hFind, &FindFileData) !=3D 0)=20
    {
        if (!_wstat(FindFileData.cFileName, &buf))
            printf("file: %S, ino: %u\n", FindFileData.cFileName, buf.st_=
ino);
    }

    FindClose(hFind);
    return 0;
}



--=20
=2Emarius

From: Johannes Sixt
Date: Monday, September 3, 2007 - 12:47 am

Unfortunately, the patch fails t0010-racy-git.sh. I suspect the filetime 

Shouldn't this be 1000000000 according to your comment? However, even if 

-- Hannes

-

From: Marius Storm-Olsen
Date: Monday, September 3, 2007 - 12:55 am

Ok, I'll try to get to it later today.

--=20
=2Emarius

From: Alex Riesen
Date: Sunday, September 2, 2007 - 1:02 pm

just use "strace -e fstat,stat,lstat,stat64,lstat64 -f git-status"
on sane platform.

-

From: Marius Storm-Olsen
Date: Sunday, September 2, 2007 - 1:09 pm

Right, I was doing this on Windows, where strace is rather.. limited, so
I thought I'd just share the code for all to play with :-)

But, on my linux box:
$ strace -e fstat,stat,lstat,stat64,lstat64 -f git-status 2>&1 | wc -l
300195

(Slightly more stats there, as expected, due to the listings of the
shells stats too, and not just the builtins.)

--
=2Emarius

From: Matthieu Moy
Date: Monday, September 3, 2007 - 1:19 am

My experiments show 2 stats only:

$ strace -f git status |& grep -e execve -e  foo                  
[...]
[pid 22492] execve("/home/moy/local/usr//bin/git", ["git", "update-index", "-q", "--unmerged", "--refresh"], [/* 50 vars */]) = 0
[pid 22492] lstat64("foo", {st_mode=S_IFREG|0755, st_size=0, ...}) = 0
[pid 22493] execve("/home/moy/local/usr//bin/git", ["git", "runstatus"], [/* 49 vars */]) = 0
[pid 22493] lstat64("foo", {st_mode=S_IFREG|0755, st_size=0, ...}) = 0
[...]

Once for "git update-index --refresh" and once more for "git
runstatus". Obviously, a builtin with one tree traversal only would
provide a good speedup.

-- 
Matthieu
-

Previous thread: [PATCH 0/5] gitweb: Support for arbitrary diffs by Martin Koegler on Sunday, September 2, 2007 - 7:46 am. (7 messages)

Next thread: Re: git-gui i18n status? by Michele Ballabio on Sunday, September 2, 2007 - 8:09 am. (1 message)