Re: [PATCH v4 1/3] Avoid conflicts when merging branches with mixed normalization

Previous thread: [PATCH] Clarify that git-cherry-pick applies patches to HEAD by Robert Spanton on Sunday, June 27, 2010 - 6:52 am. (2 messages)

Next thread: What's cooking in git.git (Jun 2010, #05; Sun, 27) by Junio C Hamano on Sunday, June 27, 2010 - 1:11 pm. (5 messages)
From: Eyvind Bernhardsen
Date: Sunday, June 27, 2010 - 12:43 pm

Only a few changes since the last round:

- added some paragraphs to the filter documentation in gitattributes to
  explain why having normalizing filters is a good thing (comments
  welcome)

- fixed the problem Johannes spotted in the eol=crlf optimization by
  expanding CRLFs (ie not optimizing) if a smudge filter is configured

- moved the opening brace in my function definitions to the next line

Thanks for your input.  I'm pretty happy with this now.
-- 
Eyvind

Eyvind Bernhardsen (3):
  Avoid conflicts when merging branches with mixed normalization
  Try normalizing files to avoid delete/modify conflicts when merging
  Don't expand CRLFs when normalizing text during merge

 Documentation/gitattributes.txt |   27 ++++++++++++++++++
 cache.h                         |    1 +
 convert.c                       |   37 +++++++++++++++++++++----
 ll-merge.c                      |   13 +++++++++
 merge-recursive.c               |   44 ++++++++++++++++++++++++++++-
 t/t6038-merge-text-auto.sh      |   58 +++++++++++++++++++++++++++++++++++++++
 6 files changed, 172 insertions(+), 8 deletions(-)
 create mode 100755 t/t6038-merge-text-auto.sh

-- 
1.7.1.575.g383de

--

From: Eyvind Bernhardsen
Date: Sunday, June 27, 2010 - 12:43 pm

Disable CRLF expansion when convert_to_working_tree() is called from
normalize_buffer().  This improves performance when merging branches
with conflicting line endings when core.eol=crlf or core.autocrlf=true
by making the normalization act as if core.eol=lf.

Signed-off-by: Eyvind Bernhardsen <eyvind.bernhardsen@gmail.com>
---
 convert.c |   27 ++++++++++++++++++++-------
 1 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/convert.c b/convert.c
index 0203be8..01de9a8 100644
--- a/convert.c
+++ b/convert.c
@@ -741,7 +741,9 @@ int convert_to_git(const char *path, const char *src, size_t len,
 	return ret | ident_to_git(path, src, len, dst, ident);
 }
 
-int convert_to_working_tree(const char *path, const char *src, size_t len, struct strbuf *dst)
+static int convert_to_working_tree_internal(const char *path, const char *src,
+					    size_t len, struct strbuf *dst,
+					    int normalizing)
 {
 	struct git_attr_check check[5];
 	enum action action = CRLF_GUESS;
@@ -767,18 +769,29 @@ int convert_to_working_tree(const char *path, const char *src, size_t len, struc
 		src = dst->buf;
 		len = dst->len;
 	}
-	action = determine_action(action, eol_attr);
-	ret |= crlf_to_worktree(path, src, len, dst, action);
-	if (ret) {
-		src = dst->buf;
-		len = dst->len;
+	/*
+	 * CRLF conversion can be skipped if normalizing, unless there
+	 * is a smudge filter.  The filter might expect CRLFs.
+	 */
+	if (filter || !normalizing) {
+		action = determine_action(action, eol_attr);
+		ret |= crlf_to_worktree(path, src, len, dst, action);
+		if (ret) {
+			src = dst->buf;
+			len = dst->len;
+		}
 	}
 	return ret | apply_filter(path, src, len, dst, filter);
 }
 
+int convert_to_working_tree(const char *path, const char *src, size_t len, struct strbuf *dst)
+{
+	return convert_to_working_tree_internal(path, src, len, dst, 0);
+}
+
 int renormalize_buffer(const char *path, const char *src, size_t len, struct strbuf *dst)
 {
-	int ret = ...
From: Eyvind Bernhardsen
Date: Sunday, June 27, 2010 - 12:43 pm

Currently, merging across changes in line ending normalization is
painful since files containing CRLF will conflict with normalized files,
even if the only difference between the two versions is the line
endings.  Additionally, any "real" merge conflicts that exist are
obscured because every line in the file has a conflict.

Assume you start out with a repo that has a lot of text files with CRLF
checked in (A):

      o---C
     /     \
    A---B---D

B: Add "* text=auto" to .gitattributes and normalize all files to
   LF-only

C: Modify some of the text files

D: Try to merge C

You will get a ridiculous number of LF/CRLF conflicts when trying to
merge C into D, since the repository contents for C are "wrong" wrt the
new .gitattributes file.

Fix ll-merge so that the "base", "theirs" and "ours" stages are passed
through convert_to_worktree() and convert_to_git() before a three-way
merge.  This ensures that all three stages are normalized in the same
way, removing from consideration differences that are only due to
normalization.

Signed-off-by: Eyvind Bernhardsen <eyvind.bernhardsen@gmail.com>
---
 Documentation/gitattributes.txt |   27 ++++++++++++++++++
 cache.h                         |    1 +
 convert.c                       |   16 +++++++++-
 ll-merge.c                      |   13 +++++++++
 t/t6038-merge-text-auto.sh      |   58 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 113 insertions(+), 2 deletions(-)
 create mode 100755 t/t6038-merge-text-auto.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 564586b..b110082 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -317,6 +317,18 @@ command is "cat").
 	smudge = cat
 ------------------------
 
+For best results, `clean` and `smudge` commands should produce output
+that is not dependent on the corresponding command having been run.
+That is, `clean` should produce identical output whether its input has
+been run ...
From: Finn Arne Gangstad
Date: Monday, June 28, 2010 - 1:02 am

I think this is marginally unclear, what about:

  Clean should not alter its output further if run again
  clean(x) == clean(clean(x))

  Smudge should not alter the output of clean
  clean(x) == clean(smudge(x))

It should not matter that smudge will do something weird if clean
hasn't been run, as long as clean(x) == clean(smudge(x)) still
holds. I also think it is worth mentioning explicitly that clean can

Maybe something about when this happens, or even put it in the header
instead? Something like

  If you have added attributes to a file that cause the canonical
  repository format for that file to change, such as adding a
  clean/smudge filter or text/eol/ident attributes, merging anything based
  on a point in time where the attribute was not in place would normally


- Finn Arne
--

From: Eyvind Bernhardsen
Date: Monday, June 28, 2010 - 12:32 pm

Signed-off-by: Eyvind Bernhardsen <eyvind.bernhardsen@gmail.com>
---
How does this look?

This commit should really be squashed into the first one in the series.
Live and learn, next time I'll add new changes in a new commit and put
it _last_.  I'll submit a fixed series once we're happy with the
documentation.

- Eyvind

 Documentation/gitattributes.txt |   46 ++++++++++++++++++++++-----------------
 1 files changed, 26 insertions(+), 20 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index b110082..22400c1 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -317,17 +317,16 @@ command is "cat").
 	smudge = cat
 ------------------------
 
-For best results, `clean` and `smudge` commands should produce output
-that is not dependent on the corresponding command having been run.
-That is, `clean` should produce identical output whether its input has
-been run through `smudge` or not, and `smudge` should not rely on its
-input having been run through `clean`.  See the section on merging
-below for a rationale.
+For best results, `clean` should not alter its output further if it is
+run twice ("clean->clean" should be equivalent to "clean"), and
+multiple `smudge` commands should not alter `clean`'s output
+("smudge->smudge->clean" should be equivalent to "clean").  See the
+section on merging below.
 
-The example "indent" filter is well-behaved in this regard: it will
-accept input that is already correctly indented without modifying it.
-In this case, the lack of a smudge filter means that the clean filter
-_must_ accept its own output without modifying it.
+The "indent" filter is well-behaved in this regard: it will not modify
+input that is already correctly indented.  In this case, the lack of a
+smudge filter means that the clean filter _must_ accept its own output
+without modifying it.
 
 
 Interaction between checkin/checkout attributes
@@ -346,16 +345,23 @@ with `text`, and then ...
From: Finn Arne Gangstad
Date: Monday, June 28, 2010 - 1:31 pm

Looks good I think!

- Finn Arne
--

From: Junio C Hamano
Date: Tuesday, June 29, 2010 - 9:19 am

Looks Ok (I didn't read _this_ patch but read a squashed-in result),

This naturally calls for an optimization idea, doesn't it?

I wonder if ll_merge should gain another flag bit to disable the calls to
normalize_file(), so that the whole thing can be skipped when the caller
somehow knows .gitattributes files that govern the path didn't change.

That won't be a trivial optimization and my gut feeling is that it
shouldn't be part of this series.

I do however wonder if this should be initially introduced as an
experimental feature, guarded with a configuration option for brave souls
to try it out, and flip the feature on by default after we gain confidence
in it, both in performance and in correctness.

-- >8 --
Introduce "double conversion during merge" more gradually

This marks the recent improvement to the merge machinery that helps people
who changed their mind between CRLF/LF an opt in feature, so that we can
more easily release it early to everybody, without fear of breaking the
majority of users (read: on POSIX) that don't need it.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config.txt        |   10 ++++++++++
 Documentation/gitattributes.txt |    5 +++--
 cache.h                         |    1 +
 config.c                        |    5 +++++
 environment.c                   |    1 +
 ll-merge.c                      |    8 +++++---
 6 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 4c49104..ad2a27e 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -538,6 +538,16 @@ core.sparseCheckout::
 	Enable "sparse checkout" feature. See section "Sparse checkout" in
 	linkgit:git-read-tree[1] for more information.
 
+core.doubleConvert::
+	Tell git that canonical representation of files in the repository
+	has changed over time (e.g. earlier commits record text files
+	with CRLF line endings, but recent ones use LF line endings).  In
+	such a ...
From: Eyvind Bernhardsen
Date: Tuesday, June 29, 2010 - 2:18 pm

Agreed, and thanks.  Messing with the merge machinery unnerves me a little.

Shouldn't the normalization in merge-recursive be conditional too?

Something like this squashed into the delete/modify patch:

--8<--
diff --git a/merge-recursive.c b/merge-recursive.c
index a2c174f..49bd3d2 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -1079,6 +1079,8 @@ static int normalized_eq(const unsigned char *a_sha,
 {
        struct strbuf a = STRBUF_INIT;
        struct strbuf b = STRBUF_INIT;
+       if (!core_ll_merge_double_convert)
+               return 0;
        int ret = 0;
        if (a_sha && b_sha &&
            read_sha1_strbuf(a_sha, path, &a) &&
--8<--

Sorry about the whitespace-damaged inline diff, I'll redo the series if necessary.
-- 
Eyvind

--

From: Junio C Hamano
Date: Wednesday, June 30, 2010 - 10:46 am

True, but your patch to merge-recursive is broken, I think.  It should at
least look like the attached rewrite.

Key points:

 - Your "normalized_eq()" is called only from the codepath that wants to
   know if "one is deleted and the other is unchanged" to implement the
   latter half of that check.  Let's name that function blob_unchanged().
   Also let's move the "how to compare two blobs when we are not doing
   double conversion" (i.e. sha_eq()) logic from its caller to it.  These
   changes would make it a lot easier to read the caller.

 - Your read_sha1_strbuf() took "path" as input but didn't do anything
   with it.  Let's drop it.

 - It didn't diagnose any errors.  Let's make sure that it follows the
   usual "0 for success, negative for error" convention and make sure its
   only caller knows about it.

 - "blob_unchanged()" (aka "normalized_eq()") is called only when we have
   two blobs to compare.  It is a programming error to pass NULL in either
   o_sha or a_sha; let's assert() it.

 - Your error path in the function made the caller assume that two input
   matched and it is Ok to continue; it should instead force the caller to
   stop so that the end user can notice.

 - Return values from functions in convert.c are NOT error signals.  A
   true value is to let the caller know that the callee had to convert
   (and a false value is that the callee didn't convert), and the next
   transformation needs to be done on the result in the strbuf (as opposed
   to the src buffer that was left inact).

   In your use of renormalize_buffer(), your input "src" points at the
   same strbuf as your output "dst", so your caller does not care if
   renormalize didn't have to do anything.  Either way, you would pick the
   result up from the strbuf.

   But using the return value to skip the comparison as if the call failed
   is _wrong_.  Even if there is no need for conversion for the given path
   (i.e. 0 return from convert.c functions), you would need to pick up ...
From: Eyvind Bernhardsen
Date: Wednesday, June 30, 2010 - 2:32 pm

Thanks for the detailed review and rewrite!  I unexpectedly had to spend the evening working on non-git related stuff, so I haven't even had time to send my promised re-roll.


Your patch is better than mine, but in my defense I think you misdiagnosed the big problems.  The error path in normalized_eq returned 0 in case of problems, making the caller assume that the files differ and generate a merge conflict, and the return code from normalize_buffer was used correctly (see below).



Yes, absolutely.  The tests are pretty threadbare for such a potentially dangerous change, so I'll cop to a "shiny new toy" charge.  I'll also cop to writing hard-to-read code, but I still think it worked :)


I would rather do this:

	if (sha_eq(o_sha, a_sha))
		return 1;
	if (!core_ll_merge_double_convert)
		return 0;

If the sha1s are equal before normalization the files will be equal after normalization, so there's no sense in going through the rigmarole.


This was one of your points, but I deliberately skipped the memcmp here if _neither_ of the renormalize_buffers did any work (binary "|" instead of boolean "||" to ensure both sides are evaluated, but the comment was perhaps a little obscure).


I'm not a goto objector, just curious: what's the advantage of "goto error_return" here vs. having the renormalizing code inside the if and inverting the test?

Thanks again for taking the time to improve my patch.
-- 
Eyvind

--

From: Junio C Hamano
Date: Wednesday, June 30, 2010 - 8:33 pm

You are absolutely right.  This part of my patch was an unnecessary
pessimization.

Will fix.
--

From: Eyvind Bernhardsen
Date: Wednesday, June 30, 2010 - 1:20 am

My fix to add the configuration option to the delete/modify patch yesterday was pretty bad, sorry.  My only excuse is that I was in a hurry, I'll resend the series tonight with a better fix.
-- 
Eyvind Bernhardsen

--

From: Junio C Hamano
Date: Wednesday, June 30, 2010 - 8:15 am

Nothing that elaborate.

I was envisioning that we would compare object names of the .gitattributes
files in directories that lead to the path being merged in three trees,
and we use the new slowpath unless all three match.  You could look _into_
the actual contents of .gitattributes and decide that a particular change
does not affect the path you are merging, but I don't think it is worth
it; sane people are expected not to flip CRLF/LF around many times a day
anyway, so changes to .gitattributes should already be rare events.

We will be walking the trees in parallel while merging anyway, so when you
have to merge a/b/c.txt, we would already have opened the top-level tree,
tree "a", and tree "a/b" already, and we should be able pick up the object
name of .gitattributes, a/.gitattributes and b/.gitattributes cheaply
without opening any extra object.

--

From: Eyvind Bernhardsen
Date: Sunday, June 27, 2010 - 12:43 pm

If a file is modified due to normalization on one branch, and deleted on
another, a merge of the two branches will result in a delete/modify
conflict for that file even if it is otherwise unchanged.

Try to avoid the conflict by normalizing and comparing the "base" file
and the modified file when their sha1s differ.  If they compare equal,
the file is considered unmodified and is deleted.

Signed-off-by: Eyvind Bernhardsen <eyvind.bernhardsen@gmail.com>
---
 merge-recursive.c          |   44 ++++++++++++++++++++++++++++++++++++++++++--
 t/t6038-merge-text-auto.sh |    2 +-
 2 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index 206c103..f4f09a2 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -1056,6 +1056,44 @@ static unsigned char *stage_sha(const unsigned char *sha, unsigned mode)
 	return (is_null_sha1(sha) || mode == 0) ? NULL: (unsigned char *)sha;
 }
 
+static int read_sha1_strbuf(const unsigned char *sha, const char *path,
+			    struct strbuf *dst)
+{
+	void *buf;
+	enum object_type type;
+	unsigned long size;
+	buf = read_sha1_file(sha, &type, &size);
+	if (!buf)
+		return 0;
+	if (type != OBJ_BLOB) {
+		free(buf);
+		return 0;
+	}
+	strbuf_attach(dst, buf, size, size + 1);
+	return 1;
+}
+
+static int normalized_eq(const unsigned char *a_sha,
+			 const unsigned char *b_sha,
+			 const char *path)
+{
+	struct strbuf a = STRBUF_INIT;
+	struct strbuf b = STRBUF_INIT;
+	int ret = 0;
+	if (a_sha && b_sha &&
+	    read_sha1_strbuf(a_sha, path, &a) &&
+	    read_sha1_strbuf(b_sha, path, &b)) {
+		/* Both files must be normalized, so we can't use || */
+		if ((renormalize_buffer(path, a.buf, a.len, &a) |
+		     renormalize_buffer(path, b.buf, b.len, &b)) &&
+		    (a.len == b.len))
+			ret = memcmp(a.buf, b.buf, a.len) == 0;
+	}
+	strbuf_release(&a);
+	strbuf_release(&b);
+	return ret;
+}
+
 /* Per entry merge function */
 static int process_entry(struct merge_options *o,
 			 ...
Previous thread: [PATCH] Clarify that git-cherry-pick applies patches to HEAD by Robert Spanton on Sunday, June 27, 2010 - 6:52 am. (2 messages)

Next thread: What's cooking in git.git (Jun 2010, #05; Sun, 27) by Junio C Hamano on Sunday, June 27, 2010 - 1:11 pm. (5 messages)