Re: [PU PATCH] Fix git fetch for very large ref counts

Previous thread: Re: Efficiency of initial clone from server by Jakub Narebski on Monday, February 12, 2007 - 5:51 pm. (1 message)

Next thread: stgit "unknown user details" bug? by Adam J. Richter on Monday, February 12, 2007 - 9:34 pm. (6 messages)
From: Julian Phillips
Date: Monday, February 12, 2007 - 6:21 pm

The updated git fetch in pu is vastly improved on repositories with very
large numbers of refs.  The time taken for a no-op fetch over ~9000 refs
drops from ~48m to ~0.5m.

However, before git fetch will actually run on a repository with ~9000
refs the calling interface between fetch and fetch--tool needs to be
changed.  The existing version passes the entire reflist on the command
line, which means that it is subject to the maxiumum environment size
passed to a child process by execve.

The following patches add a stdin based interface to fetch--tool allowing
the ~9000 refs to be passed without exceeding the environment limit.

--
Julian
-

From: Julian Phillips
Date: Monday, February 12, 2007 - 6:21 pm

If the reflist is "-" then read the reflist data from stdin instead,
this will allow the passing of more than 128K of reflist data - which
won't fit in the environment passed by execve.

Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
---
 builtin-fetch--tool.c |   24 ++++++++++++++++++++++--
 1 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/builtin-fetch--tool.c b/builtin-fetch--tool.c
index 3090ffe..619ceb0 100644
--- a/builtin-fetch--tool.c
+++ b/builtin-fetch--tool.c
@@ -2,6 +2,20 @@
 #include "refs.h"
 #include "commit.h"
 
+static char *get_stdin(void)
+{
+#define CHUNK_SIZE (1048576)
+	char *data = xmalloc(CHUNK_SIZE);
+	int offset = 0, read = 0;
+	read = xread(0, data, CHUNK_SIZE);
+	while (read == CHUNK_SIZE) {
+		offset += CHUNK_SIZE;
+		data = xrealloc(data, offset + CHUNK_SIZE);
+		read = xread(0, data + offset, CHUNK_SIZE);
+	}
+	return data;
+}
+
 static void show_new(char *type, unsigned char *sha1_new)
 {
 	fprintf(stderr, "  %s: %s\n", type,
@@ -463,12 +477,18 @@ int cmd_fetch__tool(int argc, const char **argv, const char *prefix)
 	if (!strcmp("parse-reflist", argv[1])) {
 		if (argc != 3)
 			return error("parse-reflist takes 1 arg");
-		return parse_reflist(argv[2]);
+		const char *reflist = argv[2];
+		if (!strcmp(reflist, "-"))
+			reflist = get_stdin();
+		return parse_reflist(reflist);
 	}
 	if (!strcmp("expand-refs-wildcard", argv[1])) {
 		if (argc < 4)
 			return error("expand-refs-wildcard takes at least 2 args");
-		return expand_refs_wildcard(argv[2], argc - 3, argv + 3);
+		const char *reflist = argv[2];
+		if (!strcmp(reflist, "-"))
+			reflist = get_stdin();
+		return expand_refs_wildcard(reflist, argc - 3, argv + 3);
 	}
 
 	return error("Unknown subcommand: %s", argv[1]);
-- 
1.4.4.4

-

From: Julian Phillips
Date: Monday, February 12, 2007 - 6:21 pm

Use the new stdin reflist passing mechanism for the call to
fetch--tool expand-refs-wildcard, allowing passing of more
than ~128K of reflist data.

Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
---
 git-parse-remote.sh |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/git-parse-remote.sh b/git-parse-remote.sh
index 9b19a21..185eb54 100755
--- a/git-parse-remote.sh
+++ b/git-parse-remote.sh
@@ -81,7 +81,7 @@ get_remote_default_refs_for_push () {
 # is to help prevent randomly "globbed" ref from being chosen as
 # a merge candidate
 expand_refs_wildcard () {
-	git fetch--tool expand-refs-wildcard "$ls_remote_result" "$@"
+	echo "$ls_remote_result" | git fetch--tool expand-refs-wildcard "-" "$@"
 }
 
 # Subroutine to canonicalize remote:local notation.
-- 
1.4.4.4

-

From: Julian Phillips
Date: Monday, February 12, 2007 - 6:21 pm

Use the new stdin reflist passing mechanism for the call to
fetch--tool parse-reflist, allowing passing of more than ~128K
of reflist data.

Signed-off-by: Julian Phillips <julian@quantumfyre.co.uk>
---
 git-fetch.sh |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/git-fetch.sh b/git-fetch.sh
index 6a8b196..5a1b4e7 100755
--- a/git-fetch.sh
+++ b/git-fetch.sh
@@ -169,7 +169,7 @@ fi
 
 fetch_native () {
 
-  eval=$(git-fetch--tool parse-reflist "$1")
+  eval=$(echo "$1" | git-fetch--tool parse-reflist "-")
   eval "$eval"
 
     ( : subshell because we muck with IFS
-- 
1.4.4.4

-

From: Linus Torvalds
Date: Monday, February 12, 2007 - 7:31 pm

This pretty much seems to assume that "echo" is a shell built-in.

Otherwise, the

	echo "$1"

part will still fall afoul of any exec limits.

Maybe that's aperfectly fine assumption. It's certainly true for bash (and 
probably _any_ shells that do any built-ins at all - echo is pretty damn 
basic ;)

		Linus
-

From: Junio C Hamano
Date: Monday, February 12, 2007 - 8:18 pm

Thanks.

But the ones in 'pu' were done primarily as demonstration of
where the bottlenecks are, and not meant for real-world
consumption.  I think the final shaving of 0.5m down to a few
seconds needs to move the ls_remote_result string currently kept
as a shell variable to a list of strings represented in a
git-fetch largely rewritten in C, and at that point the
interface from outside fetch--tool to throw 9000 refs at it
would be an internal function call and the code you fixed along
with new function you introduced would probably need to be
discarded.




-

From: Julian Phillips
Date: Tuesday, February 13, 2007 - 3:39 am

And there I was thinking 0.5m was fast ... given how long I've been 
reading this list I really should have know better. ;)

I only really made the changes so I could try your improvements to fetch 
out - if they aren't necessary because you're making it even faster then I 
really don't have much cause to complain.

-- 
Julian

  ---
To be a kind of moral Unix, he touched the hem of Nature's shift.
 		-- Shelley
-

From: Junio C Hamano
Date: Tuesday, February 13, 2007 - 10:58 am

Well, the only thing we need to do in fetching between two
repositories that are already identical should be to compare
ls-remote output from both ends and immediately declare victory,
and it is unacceptable for such an obvious no-op to take 30
seconds.  At this point it really is the shell loop that is

I've applied your fixes to jc/fetch topic and merged it back to
'pu'.  Thanks.


-

Previous thread: Re: Efficiency of initial clone from server by Jakub Narebski on Monday, February 12, 2007 - 5:51 pm. (1 message)

Next thread: stgit "unknown user details" bug? by Adam J. Richter on Monday, February 12, 2007 - 9:34 pm. (6 messages)