Re: [PATCH] Teach "git add" and friends to be paranoid

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Junio C Hamano
Date: Wednesday, February 17, 2010 - 10:36 pm

Nicolas Pitre <nico@fluxnic.net> writes:


I did a very unscientific test to hash about 14k paths (arch/ and fs/ from
the kernel source) using "git-hash-object -w --stdin-paths" into an empty
repository with varying sizes of paranoia buffer (quarter, 1, 4, 8 and
256kB) and saw 8-30% overhead.  256kB did hurt and around 4kB seemed to be
optimal for my this small sample load.

In any case, with any size of paranoia, this hurts the sane use case, so
I'd introduce an expert switch to disable it, like this.

-- >8 --
When creating a loose object, we normally mmap(2) the entire file, and
hash and then compress to write it out in two separate steps for
efficiency.

This is perfectly good for the intended use of git---nobody is supposed to
be insane enough to expect that it won't break anything to muck with the
contents of a file after telling git to index it and before getting the
control back from git.

But the nature of breakage caused by such an abuse is rather bad.  We will
end up with loose object files, whose names do not match what are stored
and recovered when uncompressed.

This teaches the index_mem() codepath to be paranoid and hash and compress
the data after reading it in core.  The contents hashed may not match the
contents of the file in an insane use case, but at least this way the
result will be internally consistent.

People with saner use of git can regain performance by setting a new
configuration variable 'core.volatilefiles' to false to disable this
check.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config.txt |    7 ++++
 cache.h                  |    1 +
 config.c                 |    6 +++
 environment.c            |    1 +
 sha1_file.c              |   83 +++++++++++++++++++++++++++++++++++++--------
 5 files changed, 83 insertions(+), 15 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 52786c7..0295aee 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -117,6 +117,14 @@ core.fileMode::
 	the working copy are ignored; useful on broken filesystems like FAT.
 	See linkgit:git-update-index[1]. True by default.
 
+core.volatilefiles::
+	If you modify a file after telling git to record it (e.g. with
+	"git add") but before git finishes the request and gives the
+	control back to you, you may create a broken object (and of course
+	you can keep both halves ;-).  Setting this option to true will
+	tell git to be extra careful to detect the situation and abort.
+	Defaults to true.
+
 core.ignoreCygwinFSTricks::
 	This option is only used by Cygwin implementation of Git. If false,
 	the Cygwin stat() and lstat() functions are used. This may be useful
diff --git a/cache.h b/cache.h
index 231c06d..e5a87cf 100644
--- a/cache.h
+++ b/cache.h
@@ -497,6 +497,7 @@ extern int trust_ctime;
 extern int quote_path_fully;
 extern int has_symlinks;
 extern int ignore_case;
+extern int worktree_files_are_volatile;
 extern int assume_unchanged;
 extern int prefer_symlink_refs;
 extern int log_all_ref_updates;
diff --git a/config.c b/config.c
index 790405a..9898041 100644
--- a/config.c
+++ b/config.c
@@ -360,6 +360,12 @@ static int git_default_core_config(const char *var, const char *value)
 		trust_executable_bit = git_config_bool(var, value);
 		return 0;
 	}
+
+	if (!strcmp(var, "core.volatilefiles")) {
+		worktree_files_are_volatile = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.trustctime")) {
 		trust_ctime = git_config_bool(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index e278bce..5d0faf3 100644
--- a/environment.c
+++ b/environment.c
@@ -22,6 +22,7 @@ int is_bare_repository_cfg = -1; /* unspecified */
 int log_all_ref_updates = -1; /* unspecified */
 int warn_ambiguous_refs = 1;
 int repository_format_version;
+int worktree_files_are_volatile = 1; /* yuck */
 const char *git_commit_encoding;
 const char *git_log_output_encoding;
 int shared_repository = PERM_UMASK;
diff --git a/sha1_file.c b/sha1_file.c
index 52d1ead..e126179 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2335,14 +2335,18 @@ static int create_tmpfile(char *buffer, size_t bufsiz, const char *filename)
 }
 
 static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
-			      void *buf, unsigned long len, time_t mtime)
+			      void *buf, unsigned long len, time_t mtime,
+			      int paranoid)
 {
 	int fd, size, ret;
 	unsigned char *compressed;
 	z_stream stream;
 	char *filename;
 	static char tmpfile[PATH_MAX];
+	git_SHA_CTX ctx;
 
+	if (!worktree_files_are_volatile)
+		paranoid = 0;
 	filename = sha1_file_name(sha1);
 	fd = create_tmpfile(tmpfile, sizeof(tmpfile), filename);
 	if (fd < 0) {
@@ -2366,12 +2370,41 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 	stream.next_in = (unsigned char *)hdr;
 	stream.avail_in = hdrlen;
 	while (deflate(&stream, 0) == Z_OK)
-		/* nothing */;
+		; /* nothing */
 
 	/* Then the data itself.. */
-	stream.next_in = buf;
-	stream.avail_in = len;
-	ret = deflate(&stream, Z_FINISH);
+	if (paranoid) {
+		unsigned char stablebuf[4096];
+		char *bufptr = buf;
+		unsigned long remainder = len;
+
+		git_SHA1_Init(&ctx);
+		git_SHA1_Update(&ctx, hdr, hdrlen);
+
+		ret = Z_OK;
+		while (remainder) {
+			unsigned long chunklen = remainder;
+
+			if (sizeof(stablebuf) <= chunklen)
+				chunklen = sizeof(stablebuf);
+			memcpy(stablebuf, bufptr, chunklen);
+			git_SHA1_Update(&ctx, stablebuf, chunklen);
+			stream.next_in = stablebuf;
+			stream.avail_in = chunklen;
+			do {
+				ret = deflate(&stream, Z_NO_FLUSH);
+			} while (ret == Z_OK);
+			bufptr += chunklen;
+			remainder -= chunklen;
+		}
+		if (ret != Z_STREAM_END)
+			ret = deflate(&stream, Z_FINISH);
+	} else {
+		stream.next_in = buf;
+		stream.avail_in = len;
+		ret = deflate(&stream, Z_FINISH);
+	}
+
 	if (ret != Z_STREAM_END)
 		die("unable to deflate new object %s (%d)", sha1_to_hex(sha1), ret);
 
@@ -2381,6 +2414,12 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 
 	size = stream.total_out;
 
+	if (paranoid) {
+		unsigned char paranoid_sha1[20];
+		git_SHA1_Final(paranoid_sha1, &ctx);
+		if (hashcmp(paranoid_sha1, sha1))
+			die("hashed file is volatile");
+	}
 	if (write_buffer(fd, compressed, size) < 0)
 		die("unable to write sha1 file");
 	close_sha1_file(fd);
@@ -2398,7 +2437,7 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 	return move_temp_to_file(tmpfile, filename);
 }
 
-int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *returnsha1)
+static int write_sha1_file_paranoid(void *buf, unsigned long len, const char *type, unsigned char *returnsha1, int paranoid)
 {
 	unsigned char sha1[20];
 	char hdr[32];
@@ -2412,7 +2451,12 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha
 		hashcpy(returnsha1, sha1);
 	if (has_sha1_file(sha1))
 		return 0;
-	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0);
+	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0, paranoid);
+}
+
+int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *returnsha1)
+{
+	return write_sha1_file_paranoid(buf, len, type, returnsha1, 0);
 }
 
 int force_object_loose(const unsigned char *sha1, time_t mtime)
@@ -2430,7 +2474,7 @@ int force_object_loose(const unsigned char *sha1, time_t mtime)
 	if (!buf)
 		return error("cannot read sha1_file for %s", sha1_to_hex(sha1));
 	hdrlen = sprintf(hdr, "%s %lu", typename(type), len) + 1;
-	ret = write_loose_object(sha1, hdr, hdrlen, buf, len, mtime);
+	ret = write_loose_object(sha1, hdr, hdrlen, buf, len, mtime, 0);
 	free(buf);
 
 	return ret;
@@ -2467,10 +2511,15 @@ int has_sha1_file(const unsigned char *sha1)
 	return has_loose_object(sha1);
 }
 
+#define INDEX_MEM_WRITE_OBJECT  01
+#define INDEX_MEM_PARANOID      02
+
 static int index_mem(unsigned char *sha1, void *buf, size_t size,
-		     int write_object, enum object_type type, const char *path)
+		     enum object_type type, const char *path, int flag)
 {
 	int ret, re_allocated = 0;
+	int write_object = flag & INDEX_MEM_WRITE_OBJECT;
+	int paranoid = flag & INDEX_MEM_PARANOID;
 
 	if (!type)
 		type = OBJ_BLOB;
@@ -2488,9 +2537,11 @@ static int index_mem(unsigned char *sha1, void *buf, size_t size,
 	}
 
 	if (write_object)
-		ret = write_sha1_file(buf, size, typename(type), sha1);
+		ret = write_sha1_file_paranoid(buf, size, typename(type),
+					       sha1, paranoid);
 	else
 		ret = hash_sha1_file(buf, size, typename(type), sha1);
+
 	if (re_allocated)
 		free(buf);
 	return ret;
@@ -2499,23 +2550,25 @@ static int index_mem(unsigned char *sha1, void *buf, size_t size,
 int index_fd(unsigned char *sha1, int fd, struct stat *st, int write_object,
 	     enum object_type type, const char *path)
 {
-	int ret;
+	int ret, flag;
 	size_t size = xsize_t(st->st_size);
 
+	flag = write_object ? INDEX_MEM_WRITE_OBJECT : 0;
 	if (!S_ISREG(st->st_mode)) {
 		struct strbuf sbuf = STRBUF_INIT;
 		if (strbuf_read(&sbuf, fd, 4096) >= 0)
-			ret = index_mem(sha1, sbuf.buf, sbuf.len, write_object,
-					type, path);
+			ret = index_mem(sha1, sbuf.buf, sbuf.len,
+					type, path, flag);
 		else
 			ret = -1;
 		strbuf_release(&sbuf);
 	} else if (size) {
 		void *buf = xmmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
-		ret = index_mem(sha1, buf, size, write_object, type, path);
+		flag |= INDEX_MEM_PARANOID;
+		ret = index_mem(sha1, buf, size, type, path, flag);
 		munmap(buf, size);
 	} else
-		ret = index_mem(sha1, NULL, size, write_object, type, path);
+		ret = index_mem(sha1, NULL, size, type, path, flag);
 	close(fd);
 	return ret;
 }
-- 
1.7.0.81.g58679


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
[PATCH] don't use mmap() to hash files, Dmitry Potapov, (Sat Feb 13, 6:18 pm)
Re: [PATCH] don't use mmap() to hash files, Junio C Hamano, (Sat Feb 13, 6:37 pm)
Re: mmap with MAP_PRIVATE is useless, Junio C Hamano, (Sat Feb 13, 6:53 pm)
Re: [PATCH] don't use mmap() to hash files, Johannes Schindelin, (Sat Feb 13, 6:53 pm)
Re: [PATCH] don't use mmap() to hash files, Junio C Hamano, (Sat Feb 13, 7:00 pm)
Re: mmap with MAP_PRIVATE is useless, Paolo Bonzini, (Sat Feb 13, 7:11 pm)
Re: [PATCH] don't use mmap() to hash files, Dmitry Potapov, (Sat Feb 13, 7:18 pm)
Re: [PATCH] don't use mmap() to hash files, Dmitry Potapov, (Sat Feb 13, 7:42 pm)
[PATCH v2] don't use mmap() to hash files, Dmitry Potapov, (Sat Feb 13, 8:05 pm)
Re: [PATCH] don't use mmap() to hash files, Junio C Hamano, (Sat Feb 13, 8:14 pm)
Re: [PATCH] don't use mmap() to hash files, Jakub Narebski, (Sun Feb 14, 4:07 am)
Re: [PATCH] don't use mmap() to hash files, Thomas Rast, (Sun Feb 14, 4:14 am)
Re: [PATCH] don't use mmap() to hash files, Junio C Hamano, (Sun Feb 14, 4:46 am)
Re: [PATCH] don't use mmap() to hash files, Paolo Bonzini, (Sun Feb 14, 4:55 am)
Re: [PATCH] don't use mmap() to hash files, Johannes Schindelin, (Sun Feb 14, 11:10 am)
Re: [PATCH] don't use mmap() to hash files, Dmitry Potapov, (Sun Feb 14, 12:06 pm)
Re: [PATCH] don't use mmap() to hash files, Johannes Schindelin, (Sun Feb 14, 12:22 pm)
Re: [PATCH] don't use mmap() to hash files, Johannes Schindelin, (Sun Feb 14, 12:28 pm)
Re: [PATCH] don't use mmap() to hash files, Dmitry Potapov, (Sun Feb 14, 12:55 pm)
Re: [PATCH] don't use mmap() to hash files, Dmitry Potapov, (Sun Feb 14, 12:56 pm)
Re: [PATCH] don't use mmap() to hash files, Avery Pennarun, (Sun Feb 14, 4:13 pm)
Re: [PATCH] don't use mmap() to hash files, Zygo Blaxell, (Sun Feb 14, 4:52 pm)
Re: [PATCH] don't use mmap() to hash files, Nicolas Pitre, (Sun Feb 14, 9:16 pm)
Re: [PATCH] don't use mmap() to hash files, Avery Pennarun, (Sun Feb 14, 10:01 pm)
Re: [PATCH] don't use mmap() to hash files, Nicolas Pitre, (Sun Feb 14, 10:05 pm)
Re: [PATCH] don't use mmap() to hash files, Nicolas Pitre, (Sun Feb 14, 10:48 pm)
Re: [PATCH] don't use mmap() to hash files, Paolo Bonzini, (Mon Feb 15, 12:48 am)
Re: [PATCH] don't use mmap() to hash files, Dmitry Potapov, (Mon Feb 15, 5:23 am)
Re: [PATCH] don't use mmap() to hash files, Dmitry Potapov, (Mon Feb 15, 5:25 am)
Re: [PATCH] don't use mmap() to hash files, Avery Pennarun, (Mon Feb 15, 12:19 pm)
Re: [PATCH] don't use mmap() to hash files, Nicolas Pitre, (Mon Feb 15, 12:29 pm)
Re: [PATCH] Teach "git add" and friends to be paranoid, Junio C Hamano, (Wed Feb 17, 10:36 pm)
16 gig, 350,000 file repository, Bill Lear, (Thu Feb 18, 1:11 pm)
Re: 16 gig, 350,000 file repository, Nicolas Pitre, (Thu Feb 18, 1:58 pm)
Re: 16 gig, 350,000 file repository, Erik Faye-Lund, (Fri Feb 19, 2:27 am)
Re: 16 gig, 350,000 file repository, Bill Lear, (Mon Feb 22, 3:20 pm)
Re: 16 gig, 350,000 file repository, Nicolas Pitre, (Mon Feb 22, 3:31 pm)