Only a few changes since the last round: - added some paragraphs to the filter documentation in gitattributes to explain why having normalizing filters is a good thing (comments welcome) - fixed the problem Johannes spotted in the eol=crlf optimization by expanding CRLFs (ie not optimizing) if a smudge filter is configured - moved the opening brace in my function definitions to the next line Thanks for your input. I'm pretty happy with this now. -- Eyvind Eyvind Bernhardsen (3): Avoid conflicts when merging branches with mixed normalization Try normalizing files to avoid delete/modify conflicts when merging Don't expand CRLFs when normalizing text during merge Documentation/gitattributes.txt | 27 ++++++++++++++++++ cache.h | 1 + convert.c | 37 +++++++++++++++++++++---- ll-merge.c | 13 +++++++++ merge-recursive.c | 44 ++++++++++++++++++++++++++++- t/t6038-merge-text-auto.sh | 58 +++++++++++++++++++++++++++++++++++++++ 6 files changed, 172 insertions(+), 8 deletions(-) create mode 100755 t/t6038-merge-text-auto.sh -- 1.7.1.575.g383de --
Disable CRLF expansion when convert_to_working_tree() is called from
normalize_buffer(). This improves performance when merging branches
with conflicting line endings when core.eol=crlf or core.autocrlf=true
by making the normalization act as if core.eol=lf.
Signed-off-by: Eyvind Bernhardsen <eyvind.bernhardsen@gmail.com>
---
convert.c | 27 ++++++++++++++++++++-------
1 files changed, 20 insertions(+), 7 deletions(-)
diff --git a/convert.c b/convert.c
index 0203be8..01de9a8 100644
--- a/convert.c
+++ b/convert.c
@@ -741,7 +741,9 @@ int convert_to_git(const char *path, const char *src, size_t len,
return ret | ident_to_git(path, src, len, dst, ident);
}
-int convert_to_working_tree(const char *path, const char *src, size_t len, struct strbuf *dst)
+static int convert_to_working_tree_internal(const char *path, const char *src,
+ size_t len, struct strbuf *dst,
+ int normalizing)
{
struct git_attr_check check[5];
enum action action = CRLF_GUESS;
@@ -767,18 +769,29 @@ int convert_to_working_tree(const char *path, const char *src, size_t len, struc
src = dst->buf;
len = dst->len;
}
- action = determine_action(action, eol_attr);
- ret |= crlf_to_worktree(path, src, len, dst, action);
- if (ret) {
- src = dst->buf;
- len = dst->len;
+ /*
+ * CRLF conversion can be skipped if normalizing, unless there
+ * is a smudge filter. The filter might expect CRLFs.
+ */
+ if (filter || !normalizing) {
+ action = determine_action(action, eol_attr);
+ ret |= crlf_to_worktree(path, src, len, dst, action);
+ if (ret) {
+ src = dst->buf;
+ len = dst->len;
+ }
}
return ret | apply_filter(path, src, len, dst, filter);
}
+int convert_to_working_tree(const char *path, const char *src, size_t len, struct strbuf *dst)
+{
+ return convert_to_working_tree_internal(path, src, len, dst, 0);
+}
+
int renormalize_buffer(const char *path, const char *src, size_t len, struct strbuf *dst)
{
- int ret = ...Currently, merging across changes in line ending normalization is
painful since files containing CRLF will conflict with normalized files,
even if the only difference between the two versions is the line
endings. Additionally, any "real" merge conflicts that exist are
obscured because every line in the file has a conflict.
Assume you start out with a repo that has a lot of text files with CRLF
checked in (A):
o---C
/ \
A---B---D
B: Add "* text=auto" to .gitattributes and normalize all files to
LF-only
C: Modify some of the text files
D: Try to merge C
You will get a ridiculous number of LF/CRLF conflicts when trying to
merge C into D, since the repository contents for C are "wrong" wrt the
new .gitattributes file.
Fix ll-merge so that the "base", "theirs" and "ours" stages are passed
through convert_to_worktree() and convert_to_git() before a three-way
merge. This ensures that all three stages are normalized in the same
way, removing from consideration differences that are only due to
normalization.
Signed-off-by: Eyvind Bernhardsen <eyvind.bernhardsen@gmail.com>
---
Documentation/gitattributes.txt | 27 ++++++++++++++++++
cache.h | 1 +
convert.c | 16 +++++++++-
ll-merge.c | 13 +++++++++
t/t6038-merge-text-auto.sh | 58 +++++++++++++++++++++++++++++++++++++++
5 files changed, 113 insertions(+), 2 deletions(-)
create mode 100755 t/t6038-merge-text-auto.sh
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 564586b..b110082 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -317,6 +317,18 @@ command is "cat").
smudge = cat
------------------------
+For best results, `clean` and `smudge` commands should produce output
+that is not dependent on the corresponding command having been run.
+That is, `clean` should produce identical output whether its input has
+been run ...I think this is marginally unclear, what about: Clean should not alter its output further if run again clean(x) == clean(clean(x)) Smudge should not alter the output of clean clean(x) == clean(smudge(x)) It should not matter that smudge will do something weird if clean hasn't been run, as long as clean(x) == clean(smudge(x)) still holds. I also think it is worth mentioning explicitly that clean can Maybe something about when this happens, or even put it in the header instead? Something like If you have added attributes to a file that cause the canonical repository format for that file to change, such as adding a clean/smudge filter or text/eol/ident attributes, merging anything based on a point in time where the attribute was not in place would normally - Finn Arne --
Signed-off-by: Eyvind Bernhardsen <eyvind.bernhardsen@gmail.com> --- How does this look? This commit should really be squashed into the first one in the series. Live and learn, next time I'll add new changes in a new commit and put it _last_. I'll submit a fixed series once we're happy with the documentation. - Eyvind Documentation/gitattributes.txt | 46 ++++++++++++++++++++++----------------- 1 files changed, 26 insertions(+), 20 deletions(-) diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt index b110082..22400c1 100644 --- a/Documentation/gitattributes.txt +++ b/Documentation/gitattributes.txt @@ -317,17 +317,16 @@ command is "cat"). smudge = cat ------------------------ -For best results, `clean` and `smudge` commands should produce output -that is not dependent on the corresponding command having been run. -That is, `clean` should produce identical output whether its input has -been run through `smudge` or not, and `smudge` should not rely on its -input having been run through `clean`. See the section on merging -below for a rationale. +For best results, `clean` should not alter its output further if it is +run twice ("clean->clean" should be equivalent to "clean"), and +multiple `smudge` commands should not alter `clean`'s output +("smudge->smudge->clean" should be equivalent to "clean"). See the +section on merging below. -The example "indent" filter is well-behaved in this regard: it will -accept input that is already correctly indented without modifying it. -In this case, the lack of a smudge filter means that the clean filter -_must_ accept its own output without modifying it. +The "indent" filter is well-behaved in this regard: it will not modify +input that is already correctly indented. In this case, the lack of a +smudge filter means that the clean filter _must_ accept its own output +without modifying it. Interaction between checkin/checkout attributes @@ -346,16 +345,23 @@ with `text`, and then ...
Looks Ok (I didn't read _this_ patch but read a squashed-in result), This naturally calls for an optimization idea, doesn't it? I wonder if ll_merge should gain another flag bit to disable the calls to normalize_file(), so that the whole thing can be skipped when the caller somehow knows .gitattributes files that govern the path didn't change. That won't be a trivial optimization and my gut feeling is that it shouldn't be part of this series. I do however wonder if this should be initially introduced as an experimental feature, guarded with a configuration option for brave souls to try it out, and flip the feature on by default after we gain confidence in it, both in performance and in correctness. -- >8 -- Introduce "double conversion during merge" more gradually This marks the recent improvement to the merge machinery that helps people who changed their mind between CRLF/LF an opt in feature, so that we can more easily release it early to everybody, without fear of breaking the majority of users (read: on POSIX) that don't need it. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- Documentation/config.txt | 10 ++++++++++ Documentation/gitattributes.txt | 5 +++-- cache.h | 1 + config.c | 5 +++++ environment.c | 1 + ll-merge.c | 8 +++++--- 6 files changed, 25 insertions(+), 5 deletions(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index 4c49104..ad2a27e 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -538,6 +538,16 @@ core.sparseCheckout:: Enable "sparse checkout" feature. See section "Sparse checkout" in linkgit:git-read-tree[1] for more information. +core.doubleConvert:: + Tell git that canonical representation of files in the repository + has changed over time (e.g. earlier commits record text files + with CRLF line endings, but recent ones use LF line endings). In + such a ...
Agreed, and thanks. Messing with the merge machinery unnerves me a little.
Shouldn't the normalization in merge-recursive be conditional too?
Something like this squashed into the delete/modify patch:
--8<--
diff --git a/merge-recursive.c b/merge-recursive.c
index a2c174f..49bd3d2 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -1079,6 +1079,8 @@ static int normalized_eq(const unsigned char *a_sha,
{
struct strbuf a = STRBUF_INIT;
struct strbuf b = STRBUF_INIT;
+ if (!core_ll_merge_double_convert)
+ return 0;
int ret = 0;
if (a_sha && b_sha &&
read_sha1_strbuf(a_sha, path, &a) &&
--8<--
Sorry about the whitespace-damaged inline diff, I'll redo the series if necessary.
--
Eyvind
--
True, but your patch to merge-recursive is broken, I think. It should at least look like the attached rewrite. Key points: - Your "normalized_eq()" is called only from the codepath that wants to know if "one is deleted and the other is unchanged" to implement the latter half of that check. Let's name that function blob_unchanged(). Also let's move the "how to compare two blobs when we are not doing double conversion" (i.e. sha_eq()) logic from its caller to it. These changes would make it a lot easier to read the caller. - Your read_sha1_strbuf() took "path" as input but didn't do anything with it. Let's drop it. - It didn't diagnose any errors. Let's make sure that it follows the usual "0 for success, negative for error" convention and make sure its only caller knows about it. - "blob_unchanged()" (aka "normalized_eq()") is called only when we have two blobs to compare. It is a programming error to pass NULL in either o_sha or a_sha; let's assert() it. - Your error path in the function made the caller assume that two input matched and it is Ok to continue; it should instead force the caller to stop so that the end user can notice. - Return values from functions in convert.c are NOT error signals. A true value is to let the caller know that the callee had to convert (and a false value is that the callee didn't convert), and the next transformation needs to be done on the result in the strbuf (as opposed to the src buffer that was left inact). In your use of renormalize_buffer(), your input "src" points at the same strbuf as your output "dst", so your caller does not care if renormalize didn't have to do anything. Either way, you would pick the result up from the strbuf. But using the return value to skip the comparison as if the call failed is _wrong_. Even if there is no need for conversion for the given path (i.e. 0 return from convert.c functions), you would need to pick up ...
Thanks for the detailed review and rewrite! I unexpectedly had to spend the evening working on non-git related stuff, so I haven't even had time to send my promised re-roll. Your patch is better than mine, but in my defense I think you misdiagnosed the big problems. The error path in normalized_eq returned 0 in case of problems, making the caller assume that the files differ and generate a merge conflict, and the return code from normalize_buffer was used correctly (see below). Yes, absolutely. The tests are pretty threadbare for such a potentially dangerous change, so I'll cop to a "shiny new toy" charge. I'll also cop to writing hard-to-read code, but I still think it worked :) I would rather do this: if (sha_eq(o_sha, a_sha)) return 1; if (!core_ll_merge_double_convert) return 0; If the sha1s are equal before normalization the files will be equal after normalization, so there's no sense in going through the rigmarole. This was one of your points, but I deliberately skipped the memcmp here if _neither_ of the renormalize_buffers did any work (binary "|" instead of boolean "||" to ensure both sides are evaluated, but the comment was perhaps a little obscure). I'm not a goto objector, just curious: what's the advantage of "goto error_return" here vs. having the renormalizing code inside the if and inverting the test? Thanks again for taking the time to improve my patch. -- Eyvind --
You are absolutely right. This part of my patch was an unnecessary pessimization. Will fix. --
My fix to add the configuration option to the delete/modify patch yesterday was pretty bad, sorry. My only excuse is that I was in a hurry, I'll resend the series tonight with a better fix. -- Eyvind Bernhardsen --
Nothing that elaborate. I was envisioning that we would compare object names of the .gitattributes files in directories that lead to the path being merged in three trees, and we use the new slowpath unless all three match. You could look _into_ the actual contents of .gitattributes and decide that a particular change does not affect the path you are merging, but I don't think it is worth it; sane people are expected not to flip CRLF/LF around many times a day anyway, so changes to .gitattributes should already be rare events. We will be walking the trees in parallel while merging anyway, so when you have to merge a/b/c.txt, we would already have opened the top-level tree, tree "a", and tree "a/b" already, and we should be able pick up the object name of .gitattributes, a/.gitattributes and b/.gitattributes cheaply without opening any extra object. --
If a file is modified due to normalization on one branch, and deleted on
another, a merge of the two branches will result in a delete/modify
conflict for that file even if it is otherwise unchanged.
Try to avoid the conflict by normalizing and comparing the "base" file
and the modified file when their sha1s differ. If they compare equal,
the file is considered unmodified and is deleted.
Signed-off-by: Eyvind Bernhardsen <eyvind.bernhardsen@gmail.com>
---
merge-recursive.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
t/t6038-merge-text-auto.sh | 2 +-
2 files changed, 43 insertions(+), 3 deletions(-)
diff --git a/merge-recursive.c b/merge-recursive.c
index 206c103..f4f09a2 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -1056,6 +1056,44 @@ static unsigned char *stage_sha(const unsigned char *sha, unsigned mode)
return (is_null_sha1(sha) || mode == 0) ? NULL: (unsigned char *)sha;
}
+static int read_sha1_strbuf(const unsigned char *sha, const char *path,
+ struct strbuf *dst)
+{
+ void *buf;
+ enum object_type type;
+ unsigned long size;
+ buf = read_sha1_file(sha, &type, &size);
+ if (!buf)
+ return 0;
+ if (type != OBJ_BLOB) {
+ free(buf);
+ return 0;
+ }
+ strbuf_attach(dst, buf, size, size + 1);
+ return 1;
+}
+
+static int normalized_eq(const unsigned char *a_sha,
+ const unsigned char *b_sha,
+ const char *path)
+{
+ struct strbuf a = STRBUF_INIT;
+ struct strbuf b = STRBUF_INIT;
+ int ret = 0;
+ if (a_sha && b_sha &&
+ read_sha1_strbuf(a_sha, path, &a) &&
+ read_sha1_strbuf(b_sha, path, &b)) {
+ /* Both files must be normalized, so we can't use || */
+ if ((renormalize_buffer(path, a.buf, a.len, &a) |
+ renormalize_buffer(path, b.buf, b.len, &b)) &&
+ (a.len == b.len))
+ ret = memcmp(a.buf, b.buf, a.len) == 0;
+ }
+ strbuf_release(&a);
+ strbuf_release(&b);
+ return ret;
+}
+
/* Per entry merge function */
static int process_entry(struct merge_options *o,
...