I was checking out the performance situation with Git on Windows, and
found out that the Posix stat functions on Windows are just obscenely
slow. We really can't use them, at least on in Git. So, I made a patch
for the MinGW version, which I'll post right after this mail.
However, while look at that whole stat'ing situation in git, I saw
that doing 'git status' actually stats all the files _thrice_!
Yup, that's not 1 time, or 2 times, but actually 3(!) times before
'git status' is content!
I know that git-status is a script, so I think this clearly indicates
that git-status is a prime candidate for a built-in ;-)
I haven't looked into details as to why it stats the files so many
times. I guess someone more experienced in Git core could give an
opinion, if by writing git-status as a builtin it would be possible to
only stat the files once. It would have a huge impact on Windows where
stats are inheritly much slower than on Linux.
By applying the diff below, you can see for yourself what happens when
you stat the repo created with Moe's script:
mkdir bummer
cd bummer
for ((i=0;i<100;i++)); do
mkdir $i && pushd $i;
for ((j=0;j<1000;j++)); do
echo "$j" >$j; done; popd;
done
$ git status 2>&1 | wc -l
300137
Fast on Linux now, but still quite slow on Windows..
--
.marius
diff --git a/git-compat-util.h b/git-compat-util.h
index ca0a597..6b6405c 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -369,4 +369,23 @@ static inline int strtoul_ui(char const *s, int base, unsigned int *result)
return 0;
}
+static inline int git_lstat(const char *file_name, struct stat *buf)
+{
+ fprintf(stderr, "lstat: %s\n", file_name);
+ return lstat(file_name, buf);
+}
+static inline int git_fstat(int fd, struct stat *buf)
+{
+ fprintf(stderr, "fstat: %d\n", fd);
+ return fstat(fd, buf);
+}
+static inline int git_stat(const char *file_name, struct stat *buf)
+{
+ fprintf(stderr, "stat: %s\n", file_name);
+ return stat(file_name, ...This gives us a significant speedup when adding, committing and stat'ing files. (Also, since Windows doesn't really handle symlinks, it's fine that stat just uses lstat) Signed-off-by: Marius Storm-Olsen <mstormo_git@storm-olsen.com> --- The following patch will override the normal Posix implementation of stat and lstat on Windows, and use normal Windows API to ensure we're stat'ing as fast as possible. With this patch I get the performance increase to the far right. Initially, I only replaced lstat, since that's what the MinGW port of Git had implemented from before. But, since we don't really care about symlinks on Windows, I decided to simply use the same implementation for stat as well. The performance benefit is clearly indicated in the report below. With normal lstat & stat lstat, based on Win32 lstat & stat, as Win32 ------------------------- ------------------------- ------------------------- Command: git init Command: git init Command: git init ------------------------- ------------------------- ------------------------- real 0m0.047s real 0m0.047s real 0m0.078s user 0m0.031s user 0m0.031s user 0m0.031s sys 0m0.000s sys 0m0.000s sys 0m0.000s ------------------------- ------------------------- ------------------------- Command: git add . Command: git add . Command: git add . ------------------------- ------------------------- ------------------------- real 0m19.390s real 0m19.390s real 0m12.187s user 0m0.015s user 0m0.015s user 0m0.015s sys 0m0.030s sys 0m0.030s sys 0m0.015s ------------------------- ------------------------- ------------------------- Command: git commit -a... Command: git commit -a... Command: git commit -a... ------------------------- ------------------------- ...
Sorry, should have added that this is a MinGW port patch only. And I forgot to include the msysgit mailinglist, nice.. -- .marius -
This breaks executable mode reporting for things like configure scripts and other shell scripts that may, or may not, be executable. Also, you may want to turn off the executable state for some of these extensions (for example if com or cmd were not actually executable files). This makes it impossible to manipulate git repositories properly on the MinGW platform. Would it be possible to use the git tree to manage the executable state? That way, all files would not have their executable state set by default on Windows. The problem with this is how then to set the executable state? Having a git version of chmod may not be a good idea, but then how else are you going to reliably and efficiently modify the files permissions on Windows? The rest of the patch looks good on a brief initial scan. - Reece -
Actually, you don't really need the EXEC bit for Git to work. I just
added it for completeness. (We _could_ remove that too, since it's
slowing us down slightly ;-)
Remember that Git isn't using MSys for its builtins, so MinGW Git
doesn't understand the MSys notion of executable files anyways.
The MinGW port actually peeks at the beginning of a file (ignoring exe
files), and sees if there's an interpreter there. If there is, it will
expand
git-foo args...
into
sh git-foo args...
and execute the command. So, it's not really affected by this change.
I haven't had any problems with this patch on my system, so could you
explain what you mean with 'this makes it impossible to manipulate git
The file-state-in-git-tree belongs in a different discussion. Have a
look at the '.gitignore, .gitattributes, .gitmodules, .gitprecious?,
=2Egitacls? etc.' and 'tracking perms/ownership [was: empty directories]'=
threads. Permissions are not a trivial topic, since systems represent
them differently. This patch just tries to reflect the read, write and
execute permissions as normal Windows would; and it only cares about
file extensions (and the PE header, if it exists).
Also note that my patch totally ignores the Group & Others part of the
permission bits. Again, we're on Windows so we don't really care. We
_could_ make it reflect the ACLs in Windows, but then we'd have to make
Thanks
--
=2Emarius
You pull a repository that contains executable scripts that are required to work in order to build the system. You then make some modifications to the local repository and run the 'git add .' command. Since this patch is reporting executable bits differently, the mode change is stored as well as the local modifications. Now the changes are pushed upstream (along with the file mode changes). Someone running a Linux machine, pulls your changes. When those files are checked out, the executable state of those scripts has now changed, preventing the Linux user from running those scripts. _That_ I understand that this is not a trivial topic. I was thinking that this different behaviour w.r.t. the executable permission will break things when you have developers on both Linux and Windows, such as the cairo developers, for current git usage. Sure. Cygwin does use ACLs to implement stat which is why it is slow. So anything that can speed git up here, without any breakage in functionality, is a good thing. - Reece -
This is what "git config core.fileMode false" is for. See git- config's man page for information (or Documentation/config.txt). We already have a way to tell git that the "executable bit" is worthless, and any Windows port should use it. ~~ B -
Ok, so as the executable bit is worthless, there doesn't need to be any special casing in this patch to deal with it. - Reece -
Right, this is true. And I was debating it with myself, and just added it for completion; at least for the first revision of the patch. It doesn't really affect the performance all that much anyways. The conversion of the FileTime to unix time_t is far more heavy. (Which is why I'm debating to just ignore the access time) If we could somehow rather use the FileTimes directly in the index, instead of having to convert them, we could have even better performance when stat'ing on Windows. (However, it would result in an incompatible index, so everyone would have to 'git update-index --refresh' on all repositories before they can use the new version.) -- =2Emarius
Hi, Really? If so, we might consider storing FILETIME->dwHightDateTime and ->dwLowDateTime in the index. But I doubt it. AFAICT _getting_ at the stat data is the expensive thing in Windows, not a 64-bit addition, subtraction and division. Ciao, Dscho -
Haha, sure sure, _getting_ that stat data in the first place is the
expensive part on Windows. However, that's something you _have_ to do no
matter what, so there's no way around that.
Turns out that it wasn't as bad as i thought. If you have
filetime_to_time_t() just return, say 116444736, I see
git add . improve with ~0.5 sec for 100K files
and git status improve with 0.05 sec
Surely, avoiding the tripple stat'ing in 'git status' would help a lot
more ;-) So, I guess we'll just leave the timestamp conversion as is,
and avoid complicating the index.
--
=2Emarius
Your numbers show an improvement of 50% and more. That is terrific! I'll test it out an put the patch into mingw.git. I hope you don't mind if I also include your analysis and statistics in the commit message. It's worth keeping around! BTW, which of your email addresses would you like registered I'm slightly negative about this. For a native Windows project the executable bit does not matter, and for a cross-platform project this check is not sufficient, but can even become annoying (think of a file Here's an idea for the future: With this self-made stat() implementation it should also be possible to get rid of Windows's native struct stat: Make a Of course we need a bit more detailed error conditions, most importantly I'd go the short route without git_stat() and #define stat(x,y) git_lstat(x,y) -- Hannes -
Yes, I was surprised myself about the impact. Didn't think it would make
_such_ a difference. And if you compare it to the results we had before
Linus' performance fix too, just checkout the performance improvements
we've had (based on Moe's test script):
Before Now
------------------------- -------------------------
Command: git init Command: git init
------------------------- -------------------------
real 0m0.031s real 0m0.078s
user 0m0.031s user 0m0.031s
sys 0m0.000s sys 0m0.000s
------------------------- -------------------------
Command: git add . Command: git add .
------------------------- -------------------------
real 0m19.328s real 0m12.187s
user 0m0.015s user 0m0.015s
sys 0m0.015s sys 0m0.015s
------------------------- -------------------------
Command: git commit -a... Command: git commit -a...
------------------------- -------------------------
real 0m30.937s real 0m17.297s
user 0m0.015s user 0m0.015s
sys 0m0.015s sys 0m0.015s
------------------------- -------------------------
3x Command: git-status 3x Command: git-status
------------------------- -------------------------
real 0m19.531s real 0m5.344s
user 0m0.211s user 0m0.015s
sys 0m0.136s sys 0m0.031s
real 0m19.532s real 0m5.390s
user 0m0.259s user 0m0.031s
sys 0m0.091s sys 0m0.000s
real 0m19.593s real 0m5.344s
user 0m0.211s user 0m0.015s
sys 0m0.152s sys 0m0.016s
------------------------- ...Yes, please. Please don't forget to take care of the trailing-slash annoyance. -- Hannes -
This gives us a significant speedup when adding, committing and stat'ing files.
(Also, since Windows doesn't really handle symlinks, it's fine that stat just uses lstat)
Signed-off-by: Marius Storm-Olsen <mstormo_git@storm-olsen.com>
---
Revision #2 of the patch.
For this one I change the filetime_to_time_t function to do the
timestamp conversion inline in the FILETIME struct. That way we
also avoid one assignment, bitshifting and addition.
Sneaky, huh? ;-)
New stats:
-------------------------
Command: git init
-------------------------
real 0m0.047s
user 0m0.031s
sys 0m0.000s
-------------------------
Command: git add .
-------------------------
real 0m12.016s
user 0m0.015s
sys 0m0.000s
-------------------------
Command: git commit -a...
-------------------------
real 0m17.031s
user 0m0.015s
sys 0m0.030s
-------------------------
3x Command: git-status
-------------------------
real 0m5.265s
user 0m0.015s
sys 0m0.015s
real 0m5.297s
user 0m0.015s
sys 0m0.000s
real 0m5.250s
user 0m0.015s
sys 0m0.016s
-------------------------
Command: git commit...
(single file)
-------------------------
real 0m7.859s
user 0m0.015s
sys 0m0.015s
compat/mingw.c | 41 +++++++++++++++++++++++++++++++++--------
git-compat-util.h | 4 ++++
2 files changed, 37 insertions(+), 8 deletions(-)
diff --git a/compat/mingw.c b/compat/mingw.c
index 7711a3f..86a1419 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -23,19 +23,44 @@ int fchmod(int fildes, mode_t mode)
return -1;
}
-int lstat(const char *file_name, struct stat *buf)
+static inline time_t filetime_to_time_t(const FILETIME *ft)
+{
+ long long *winTime = (long long*)ft;
+ *winTime -= 116444736000000000LL; /* Windows to Unix Epoch conversion */
+ *winTime ...Hi, Oh? *goes and tries to create one on a USB stick* No. Besides, IIRC you cannot even create symlinks to another partition. Copying a symlink will copy the _linked_ file. So to call this "symlink" is a little... uhm... preposterous. Plus, on a page linked from the link you posted, it says that it is only supported from Vista onwards. So you must be kidding me. Ciao, Dscho -
Hi, Like almost every developer in the corporate world? Fact is: this support of symlinks is ridiculous. Why not just admit it? Ciao, Dscho -
And if I understand the documentation correctly, then $ mkdir foo && cd foo $ cat ../x x: No such file or directory Right? The docs say that symlinks without backslash are relative to the current directory (!!!). -- Hannes -
Robin Rosenberg, Sun, Sep 02, 2007 22:27:59 +0200: > s
Yeah, I know about Vista's improved support for symbolic links. However, I think we can let that lay for a while, until we decide to=20 make Git generate proper symlinks on Vista. I don't see it as a 1st=20 priority at the moment, and we can always add the needed functionality=20 in a separate stat() function later. --=20 =2Emarius
Hi, ... and force everybody to upgrade to Vista, thereby working for Microsoft for free? You _know_ that I will oppose that change. Ciao, Dscho -
;-) I wouldn't dream of it! Nah, my comment was more 'allow usage of proper Symlinks on Vista' at=20 a later point. I would still argue that the default would be what we=20 have today. So, it would have to be an option. But seeing what they've done to the symlinks there, it might be far=20 fetched. We'll worry about that (much) later.. --=20 =2Emarius
Hi, Yes, it is funny how they do it over and over and over again. Embrace, "Extend", Extinguish. And I thought that eventually people would be clever enough to realise... Ciao, Dscho -
You sure about that? Ever wondered why it is not so on everywhere else? -
Pretty sure. If you look at Windows' native version of stat, it will
return you st_ino =3D 0. Or maybe you where referring to something else,
and I just missed your point?
AFAIK, the ino in the index is only to be _really_ sure that nothing
has changed with the file, and we can just skip it on Windows. If in
doubt, try running this on your Windows box:
#include <windows.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
int main(int, char **)
{
wchar_t DirSpec[] =3D L".\\*";
WIN32_FIND_DATA FindFileData;
HANDLE hFind =3D FindFirstFile(DirSpec, &FindFileData);
if (hFind =3D=3D INVALID_HANDLE_VALUE) {
printf ("Crap happened: %u\n", GetLastError());
return -1;
}=20
struct _stat buf;
while (FindNextFile(hFind, &FindFileData) !=3D 0)=20
{
if (!_wstat(FindFileData.cFileName, &buf))
printf("file: %S, ino: %u\n", FindFileData.cFileName, buf.st_=
ino);
}
FindClose(hFind);
return 0;
}
--=20
=2Emarius
Unfortunately, the patch fails t0010-racy-git.sh. I suspect the filetime Shouldn't this be 1000000000 according to your comment? However, even if -- Hannes -
Ok, I'll try to get to it later today. --=20 =2Emarius
Right, I was doing this on Windows, where strace is rather.. limited, so I thought I'd just share the code for all to play with :-) But, on my linux box: $ strace -e fstat,stat,lstat,stat64,lstat64 -f git-status 2>&1 | wc -l 300195 (Slightly more stats there, as expected, due to the listings of the shells stats too, and not just the builtins.) -- =2Emarius
My experiments show 2 stats only:
$ strace -f git status |& grep -e execve -e foo
[...]
[pid 22492] execve("/home/moy/local/usr//bin/git", ["git", "update-index", "-q", "--unmerged", "--refresh"], [/* 50 vars */]) = 0
[pid 22492] lstat64("foo", {st_mode=S_IFREG|0755, st_size=0, ...}) = 0
[pid 22493] execve("/home/moy/local/usr//bin/git", ["git", "runstatus"], [/* 49 vars */]) = 0
[pid 22493] lstat64("foo", {st_mode=S_IFREG|0755, st_size=0, ...}) = 0
[...]
Once for "git update-index --refresh" and once more for "git
runstatus". Obviously, a builtin with one tree traversal only would
provide a good speedup.
--
Matthieu
-
