The updated git fetch in pu is vastly improved on repositories with very large numbers of refs. The time taken for a no-op fetch over ~9000 refs drops from ~48m to ~0.5m. However, before git fetch will actually run on a repository with ~9000 refs the calling interface between fetch and fetch--tool needs to be changed. The existing version passes the entire reflist on the command line, which means that it is subject to the maxiumum environment size passed to a child process by execve. The following patches add a stdin based interface to fetch--tool allowing the ~9000 refs to be passed without exceeding the environment limit. -- Julian -
If the reflist is "-" then read the reflist data from stdin instead,
this will allow the passing of more than 128K of reflist data - which
won't fit in the environment passed by execve.
Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
---
builtin-fetch--tool.c | 24 ++++++++++++++++++++++--
1 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/builtin-fetch--tool.c b/builtin-fetch--tool.c
index 3090ffe..619ceb0 100644
--- a/builtin-fetch--tool.c
+++ b/builtin-fetch--tool.c
@@ -2,6 +2,20 @@
#include "refs.h"
#include "commit.h"
+static char *get_stdin(void)
+{
+#define CHUNK_SIZE (1048576)
+ char *data = xmalloc(CHUNK_SIZE);
+ int offset = 0, read = 0;
+ read = xread(0, data, CHUNK_SIZE);
+ while (read == CHUNK_SIZE) {
+ offset += CHUNK_SIZE;
+ data = xrealloc(data, offset + CHUNK_SIZE);
+ read = xread(0, data + offset, CHUNK_SIZE);
+ }
+ return data;
+}
+
static void show_new(char *type, unsigned char *sha1_new)
{
fprintf(stderr, " %s: %s\n", type,
@@ -463,12 +477,18 @@ int cmd_fetch__tool(int argc, const char **argv, const char *prefix)
if (!strcmp("parse-reflist", argv[1])) {
if (argc != 3)
return error("parse-reflist takes 1 arg");
- return parse_reflist(argv[2]);
+ const char *reflist = argv[2];
+ if (!strcmp(reflist, "-"))
+ reflist = get_stdin();
+ return parse_reflist(reflist);
}
if (!strcmp("expand-refs-wildcard", argv[1])) {
if (argc < 4)
return error("expand-refs-wildcard takes at least 2 args");
- return expand_refs_wildcard(argv[2], argc - 3, argv + 3);
+ const char *reflist = argv[2];
+ if (!strcmp(reflist, "-"))
+ reflist = get_stdin();
+ return expand_refs_wildcard(reflist, argc - 3, argv + 3);
}
return error("Unknown subcommand: %s", argv[1]);
--
1.4.4.4
-
Use the new stdin reflist passing mechanism for the call to
fetch--tool expand-refs-wildcard, allowing passing of more
than ~128K of reflist data.
Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
---
git-parse-remote.sh | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/git-parse-remote.sh b/git-parse-remote.sh
index 9b19a21..185eb54 100755
--- a/git-parse-remote.sh
+++ b/git-parse-remote.sh
@@ -81,7 +81,7 @@ get_remote_default_refs_for_push () {
# is to help prevent randomly "globbed" ref from being chosen as
# a merge candidate
expand_refs_wildcard () {
- git fetch--tool expand-refs-wildcard "$ls_remote_result" "$@"
+ echo "$ls_remote_result" | git fetch--tool expand-refs-wildcard "-" "$@"
}
# Subroutine to canonicalize remote:local notation.
--
1.4.4.4
-
Use the new stdin reflist passing mechanism for the call to
fetch--tool parse-reflist, allowing passing of more than ~128K
of reflist data.
Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
---
git-fetch.sh | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/git-fetch.sh b/git-fetch.sh
index 6a8b196..5a1b4e7 100755
--- a/git-fetch.sh
+++ b/git-fetch.sh
@@ -169,7 +169,7 @@ fi
fetch_native () {
- eval=$(git-fetch--tool parse-reflist "$1")
+ eval=$(echo "$1" | git-fetch--tool parse-reflist "-")
eval "$eval"
( : subshell because we muck with IFS
--
1.4.4.4
-
This pretty much seems to assume that "echo" is a shell built-in. Otherwise, the echo "$1" part will still fall afoul of any exec limits. Maybe that's aperfectly fine assumption. It's certainly true for bash (and probably _any_ shells that do any built-ins at all - echo is pretty damn basic ;) Linus -
Thanks. But the ones in 'pu' were done primarily as demonstration of where the bottlenecks are, and not meant for real-world consumption. I think the final shaving of 0.5m down to a few seconds needs to move the ls_remote_result string currently kept as a shell variable to a list of strings represented in a git-fetch largely rewritten in C, and at that point the interface from outside fetch--tool to throw 9000 refs at it would be an internal function call and the code you fixed along with new function you introduced would probably need to be discarded. -
And there I was thinking 0.5m was fast ... given how long I've been reading this list I really should have know better. ;) I only really made the changes so I could try your improvements to fetch out - if they aren't necessary because you're making it even faster then I really don't have much cause to complain. -- Julian --- To be a kind of moral Unix, he touched the hem of Nature's shift. -- Shelley -
Well, the only thing we need to do in fetching between two repositories that are already identical should be to compare ls-remote output from both ends and immediately declare victory, and it is unacceptable for such an obvious no-op to take 30 seconds. At this point it really is the shell loop that is I've applied your fixes to jc/fetch topic and merged it back to 'pu'. Thanks. -
