Re: [PATCH 0/4] --word-regex sanity checking and such

Previous thread: [PATCH v6.1 0/8] git log -L, cleaned up and (hopefully) fixed by Thomas Rast on Tuesday, December 14, 2010 - 3:54 pm. (10 messages)

Next thread: When I merge, a binary file conflict,how can I select between 2 versions? by Chunlin Zhang on Wednesday, December 15, 2010 - 1:00 am. (3 messages)
From: Scott Johnson
Date: Tuesday, December 14, 2010 - 8:47 pm

I am attempting to do a word diff of an html source file. Part of the removed 
html is disappearing from the diff when I enable the fancy html word diff.

Here's the output from basic `git diff`:
diff --git a/adv_layout_source.html b/adv_layout_source.html
index 18a81dd..c4ed609 100644
--- a/adv_layout_source.html
+++ b/adv_layout_source.html
@@ -42,8 +42,8 @@
       <ul>
         <li class="ydn-patterns"><em></em><a href="#">ydn-patterns</a></li>
         <li class="ydn-mail"><em></em><a href="#">ydn-mail</a></li>
-        <li class="yws-maps"><em></em><a href="#">yws-maps</a></li>
-        <li class="ydn-delicious"><em></em><a href="#">ydn-delicious</a></li>
+        <li><em></em><a href="#">yws-maps</a></li>
+        <li><em></em><a href="#">ydn-delicious</a></li>
         <li class="yws-flickr"><em></em><a href="#">yws-flickr</a></li>
         <li class="yws-events"><em></em><a href="#">yws-events</a></li>
       </ul>


Here's the default `git diff --word-diff`:
diff --git a/adv_layout_source.html b/adv_layout_source.html
index 18a81dd..c4ed609 100644
--- a/adv_layout_source.html
+++ b/adv_layout_source.html
@@ -42,8 +42,8 @@
      <ul>
        <li class="ydn-patterns"><em></em><a href="#">ydn-patterns</a></li>
        <li class="ydn-mail"><em></em><a href="#">ydn-mail</a></li>
        [-<li class="yws-maps"><em></em><a-]{+<li><em></em><a+} 
href="#">yws-maps</a></li>
        [-<li class="ydn-delicious"><em></em><a-]{+<li><em></em><a+} 
href="#">ydn-delicious</a></li>
        <li class="yws-flickr"><em></em><a href="#">yws-flickr</a></li>
        <li class="yws-events"><em></em><a href="#">yws-events</a></li>
      </ul>

Which is correct, but less than ideal because it highlights much more than the 
actual changes.

So I create a .gitattributes file with one line:
*.html diff=html

And rerun `git diff --word-diff`:
diff --git a/adv_layout_source.html b/adv_layout_source.html
index 18a81dd..c4ed609 100644
--- a/adv_layout_source.html
+++ ...
From: Michael J Gruber
Date: Wednesday, December 15, 2010 - 2:06 am

The wordRegex should really only control what comprises a word, i.e. the
granularity of --word-diff. (Where do we insert additional line-breaks
before running ordinary diff?)

If a wordRegex can make parts of diff disappear than there is problem
deeper in the diff machinery. Can you trim this down to a minimal example?

Michael
--

From: Matthijs Kooijman
Date: Wednesday, December 15, 2010 - 2:12 am

It can do exactly that. The word regex determines what is a word, but
everything else is counted as "whitespace". The word diff view shows
only differences in words, not in whitespace (which is intentional,
since whitespace changes in things like LaTeX or HTML are not
interesting). Note that it doesn't show whitespace _differences_, but it
does show the whitespace itself (taken from the "new" version of the
file).


So, if the word regex somehow doesn't match the second line at all (or
That would be useful in any case.

Gr.

Matthijs
From: Michael J Gruber
Date: Wednesday, December 15, 2010 - 2:29 am

Yep, I just found out myself experimenting with a wordRegex for csv.
Seems like quite a "Gimme rope" feature...


What strikes me is that both lines are semantically identical, yet one
is treated correctly and the other isn't.

Michael
--

From: Thomas Rast
Date: Wednesday, December 15, 2010 - 8:13 am

[Forgot the list and Matthijs on the first sending.  Sorry for the
spam!]



Well. Yes. No. Maybe.

Thanks for bringing this to my attention.  I currently have enough
more serious work to avoid that this actually motivated me to hack up
a sanity check.  It's just far too error prone as it is now.

But I cannot reproduce the problem!  I put Scott's two offending lines
(taken from his "straight" diff) into t4034/html/{pre,post}, and I
think the output is valid.  Also, the word regex for html has long
included the |[^[:space:]] safeguard (actually they all do except for
bibtex, which is even more lenient on what it matches).  So you either
found an example that depends on more context (which would be *really*
bad) or there is another source of bad regexes.  Anyway, the safeguard
should easily catch the latter case.

This did unearth a bug in the ruby regex, though, so it's been worth
the trouble.

Various small issues with this patch series:

* [4/4] I stole the html test from Scott's mail, and some of the rest
  from various Wikibooks sources on "Hello World" in each language,
  usually extended by a bit of code that tests the world-splitting
  power.  I hope this is ok with Scott and the Copyright overlords.
  There are only so many ways to spell "Hello World", and only so many
  languages I know myself.

* [4/4] Many patterns do not split 1+2, probably because they stick +2
  together as a signed integer literal, even though I think they
  should.  I ran out of time to investigate however.

* [3/4] was actually detected with the help of [4/4], but putting it
  after would require heavy special casing.

* [2/4] It's a weird idiosyncrasy of the word-diff code that the exit
  status of git-diff does not depend on whether word-diff found any
  differences, and in fact the shown hunks do not either.  So the
  tests are "test_must_fail" regardless of word regex, because the
  input files differ at a byte level.  Maybe at least hunks without
  word differences should be ...
From: Thomas Rast
Date: Wednesday, December 15, 2010 - 8:13 am

The regex had an unclosed ] that pretty much ruined the safeguard
against not matching a non-space char.

Signed-off-by: Thomas Rast <trast@student.ethz.ch>
---
 userdiff.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/userdiff.c b/userdiff.c
index f9e05b5..4d6433b 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -81,7 +81,7 @@
 	 "(@|@@|\\$)?[a-zA-Z_][a-zA-Z0-9_]*"
 	 "|[-+0-9.e]+|0[xXbB]?[0-9a-fA-F]+|\\?(\\\\C-)?(\\\\M-)?."
 	 "|//=?|[-+*/<>%&^|=!]=|<<=?|>>=?|===|\\.{1,3}|::|[!=]~"
-	 "|[^[:space:]|[\x80-\xff]+"),
+	 "|[^[:space:]]|[\x80-\xff]+"),
 PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
 	 "[={}\"]|[^={}\" \t]+"),
 PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
-- 
1.7.3.3.807.g6ee1f

--

From: Thomas Rast
Date: Wednesday, December 15, 2010 - 8:13 am

The builtin word regexes should be tested with some simple examples
against simple issues, like failing to match a non-space character.

Do this in bulk.  Many of these patterns are a rather ad-hoc
combination of a few simple lines of code, so they can certainly be
improved.  However, they already unearthed a typo in the ruby pattern
(previous commit).

Signed-off-by: Thomas Rast <trast@student.ethz.ch>
---
 t/t4034-diff-words.sh  |   20 ++++++++++++++++++++
 t/t4034/bibtex/expect  |   15 +++++++++++++++
 t/t4034/bibtex/post    |   10 ++++++++++
 t/t4034/bibtex/pre     |    9 +++++++++
 t/t4034/cpp/expect     |   10 ++++++++++
 t/t4034/cpp/post       |    5 +++++
 t/t4034/cpp/pre        |    4 ++++
 t/t4034/csharp/expect  |   12 ++++++++++++
 t/t4034/csharp/post    |    8 ++++++++
 t/t4034/csharp/pre     |    8 ++++++++
 t/t4034/fortran/expect |   12 ++++++++++++
 t/t4034/fortran/post   |    7 +++++++
 t/t4034/fortran/pre    |    7 +++++++
 t/t4034/html/expect    |    7 +++++++
 t/t4034/html/post      |    2 ++
 t/t4034/html/pre       |    2 ++
 t/t4034/java/expect    |   11 +++++++++++
 t/t4034/java/post      |    6 ++++++
 t/t4034/java/pre       |    6 ++++++
 t/t4034/objc/expect    |   11 +++++++++++
 t/t4034/objc/post      |    7 +++++++
 t/t4034/objc/pre       |    7 +++++++
 t/t4034/pascal/expect  |   12 ++++++++++++
 t/t4034/pascal/post    |    7 +++++++
 t/t4034/pascal/pre     |    7 +++++++
 t/t4034/php/expect     |    7 +++++++
 t/t4034/php/post       |    2 ++
 t/t4034/php/pre        |    2 ++
 t/t4034/python/expect  |    8 ++++++++
 t/t4034/python/post    |    3 +++
 t/t4034/python/pre     |    3 +++
 t/t4034/ruby/expect    |    7 +++++++
 t/t4034/ruby/post      |    2 ++
 t/t4034/ruby/pre       |    2 ++
 t/t4034/tex/expect     |    9 +++++++++
 t/t4034/tex/post       |    4 ++++
 t/t4034/tex/pre        |    4 ++++
 37 files changed, 265 insertions(+), 0 deletions(-)
 create mode 100644 t/t4034/bibtex/expect
 create mode 100644 ...
From: Thomas Rast
Date: Wednesday, December 15, 2010 - 12:51 pm

I can't reproduce.  I did this:

  $ ls -l
  total 16
  -rw-r--r-- 1 thomas users 2128 2010-12-15 20:42 post.html
  -rw-r--r-- 1 thomas users 2354 2010-12-15 20:42 pre.html
  $ echo '*.html diff=html'  >.gitattributes
  $ git diff --no-index pre.html post.html
  diff --git 1/pre.html 2/post.html
[...]
  -        <li class="ydn-patterns"><em></em><a href="#">ydn-patterns</a></li>
  -        <li class="ydn-mail"><em></em><a href="#">ydn-mail</a></li>
  -        <li class="yws-maps"><em></em><a href="#">yws-maps</a></li>
  -        <li class="ydn-delicious"><em></em><a href="#">ydn-delicious</a></li>
  -        <li class="yws-flickr"><em></em><a href="#">yws-flickr</a></li>
  -        <li class="yws-events"><em></em><a href="#">yws-events</a></li>
  +        <li><em></em><a href="#">ydn-patterns</a></li>
  +        <li><em></em><a href="#">ydn-mail</a></li>
  +        <li><em></em><a href="#">yws-maps</a></li>
  +        <li><em></em><a href="#">ydn-delicious</a></li>
  +        <li><em></em><a href="#">yws-flickr</a></li>
  +        <li><em></em><a href="#">yws-events</a></li>
         </ul>
       </div><!-- wrap -->
     </div><!-- folder_list -->
  $ git diff --word-diff --no-index pre.html post.html
  diff --git 1/pre.html 2/post.html
[...]
          <li[-class="ydn-patterns"-]><em></em><a href="#">ydn-patterns</a></li>
          <li[-class="ydn-mail"-]><em></em><a href="#">ydn-mail</a></li>
          <li[-class="yws-maps"-]><em></em><a href="#">yws-maps</a></li>
          <li[-class="ydn-delicious"-]><em></em><a href="#">ydn-delicious</a></li>
          <li[-class="yws-flickr"-]><em></em><a href="#">yws-flickr</a></li>
          <li[-class="yws-events"-]><em></em><a href="#">yws-events</a></li>
        </ul>
      </div><!-- wrap -->
    </div><!-- folder_list -->

That's running bleeding-edge git, like I always do, but the userdiff
stuff hasn't changed in ages.

What does

  git config diff.html.wordregex

say on your system?

-- 
Thomas ...
From: Scott Johnson
Date: Wednesday, December 15, 2010 - 1:48 pm

Turns out to be system-dependent. I built v1.7.3.3 from source on three 
different boxes and only one of them is broken.


The /etc/redhat-release shows:

Broken:
Fedora Core release 6 (Zod)

Correct:
Red Hat Enterprise Linux WS release 4 (Nahant Update 6)
Fedora release 9 (Sulphur)

So I guess that means the problem is in some library that has most likely been 
fixed since Fedora 6.



----- Original Message ----
From: Thomas Rast <trast@student.ethz.ch>
To: Scott Johnson <scottj75074@yahoo.com>
Cc: Michael J Gruber <git@drmicha.warpmail.net>; Matthijs Kooijman 
<matthijs@stdin.nl>; git@vger.kernel.org
Sent: Wed, December 15, 2010 11:51:14 AM
Subject: Re: [PATCH 0/4] --word-regex sanity checking and such


I can't reproduce.  I did this:

  $ ls -l
  total 16
  -rw-r--r-- 1 thomas users 2128 2010-12-15 20:42 post.html
  -rw-r--r-- 1 thomas users 2354 2010-12-15 20:42 pre.html
  $ echo '*.html diff=html'  >.gitattributes
  $ git diff --no-index pre.html post.html
  diff --git 1/pre.html 2/post.html
[...]
  -        <li class="ydn-patterns"><em></em><a href="#">ydn-patterns</a></li>
  -        <li class="ydn-mail"><em></em><a href="#">ydn-mail</a></li>
  -        <li class="yws-maps"><em></em><a href="#">yws-maps</a></li>
  -        <li class="ydn-delicious"><em></em><a href="#">ydn-delicious</a></li>
  -        <li class="yws-flickr"><em></em><a href="#">yws-flickr</a></li>
  -        <li class="yws-events"><em></em><a href="#">yws-events</a></li>
  +        <li><em></em><a href="#">ydn-patterns</a></li>
  +        <li><em></em><a href="#">ydn-mail</a></li>
  +        <li><em></em><a href="#">yws-maps</a></li>
  +        <li><em></em><a href="#">ydn-delicious</a></li>
  +        <li><em></em><a href="#">yws-flickr</a></li>
  +        <li><em></em><a href="#">yws-events</a></li>
         </ul>
       </div><!-- wrap -->
     </div><!-- folder_list -->
  $ git diff --word-diff --no-index pre.html post.html
  diff --git 1/pre.html 2/post.html
[...]
          ...
From: Thomas Rast
Date: Saturday, December 18, 2010 - 9:17 am

I decided to play it safe, and removed these parts.  In addition I
extended the bulk tests to a C operator table and some forms of
literals for the C-style languages so as to catch issues with
non-matches.  This showed that the python regex had the same typo as
the ruby one. *blush*

Scott's problem still remains:


I briefly considered installing FC6 on a VM, but my VirtualBox is
broken and I'm having a hard time finding a FC6 installation medium.
Right now the only other systems I have are darwin 10.5 and RHEL5.5,
and the test works on both.

So in the absence of any way of testing this, someone with a breaking
system will have to investigate.  I think it's worth including the
series anyway, since the regexes give wrong results in the case of
match failures, and we would want users to know about this.


Thomas Rast (4):
  diff.c: pass struct diff_words into find_word_boundaries
  diff.c: implement a sanity check for word regexes
  userdiff: fix typo in ruby and python word regexes
  t4034: bulk verify builtin word regex sanity

 Documentation/config.txt |    8 ++++
 diff.c                   |  104 +++++++++++++++++++++++++++++++++++++++++----
 diff.h                   |    1 +
 t/t4034-diff-words.sh    |   85 +++++++++++++++++++++++++++++++++++++-
 t/t4034/bibtex/expect    |   15 +++++++
 t/t4034/bibtex/post      |   10 ++++
 t/t4034/bibtex/pre       |    9 ++++
 t/t4034/cpp/expect       |   36 ++++++++++++++++
 t/t4034/cpp/post         |   19 ++++++++
 t/t4034/cpp/pre          |   19 ++++++++
 t/t4034/csharp/expect    |   35 +++++++++++++++
 t/t4034/csharp/post      |   18 ++++++++
 t/t4034/csharp/pre       |   18 ++++++++
 t/t4034/fortran/expect   |   10 ++++
 t/t4034/fortran/post     |    5 ++
 t/t4034/fortran/pre      |    5 ++
 t/t4034/html/expect      |    8 ++++
 t/t4034/html/post        |    3 +
 t/t4034/html/pre         |    3 +
 t/t4034/java/expect      |   36 ++++++++++++++++
 t/t4034/java/post        |   19 ++++++++
 t/t4034/java/pre         |   ...
From: Thomas Rast
Date: Saturday, December 18, 2010 - 9:17 am

Word regexes are a bit of a dangerous beast, since it is easily
possible to not match a non-space part, which is subsequently ignored
for the purposes of emitting the word diff.  This was clearly stated
in the docs, but users still tripped over it.

Implement a safeguard that verifies two basic sanity assumptions:

* The word regex matches anything that is !isspace().

* The word regex does not match '\n'.  (This case is not very harmful,
  but we used to silently cut off at the '\n' which may go against
  user expectations.)

This is configurable via 'diff.wordRegexCheck', and defaults to
'warn'.

Reported-by: Scott Johnson <scottj75074@yahoo.com>
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
---
 Documentation/config.txt |    8 ++++
 diff.c                   |   93 +++++++++++++++++++++++++++++++++++++++++++--
 diff.h                   |    1 +
 t/t4034-diff-words.sh    |   65 +++++++++++++++++++++++++++++++-
 4 files changed, 161 insertions(+), 6 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index bf9479e..2e033ea 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -897,6 +897,14 @@ diff.wordRegex::
 	sequences that match the regular expression are "words", all other
 	characters are *ignorable* whitespace.
 
+diff.wordRegexCheck::
+	Perform a simple sanity check on matches of the word regex.
+	Currently this check ensures that the word regex matches all
+	non-space characters, and that the word regex does not match a
+	newline.  The setting controls what to do when the check
+	fails: 'false'/'off'/'ignore' ignore, 'true'/'on'/'warn' emit
+	a warning, and 'error' abort with an error message.
+
 fetch.recurseSubmodules::
 	A boolean value which changes the behavior for fetch and pull, the
 	default is to not recursively fetch populated sumodules unless
diff --git a/diff.c b/diff.c
index 5fdcb15..7213b2b 100644
--- a/diff.c
+++ b/diff.c
@@ -22,11 +22,17 @@
 #define FAST_WORKING_DIRECTORY 1
 #endif
 ...
From: Junio C Hamano
Date: Saturday, December 18, 2010 - 2:00 pm

How expensive to run this check twice, every time word_regex finds a
match?

As this is about making sure that we got a sane regex from the user (or a
builtin pattern), I wonder if we can make it not depend on the payload we
are matching the regex against.  Then before using a word_regex that we
have not checked, we check if that regex is sane, mark it checked, and do
not have to do the check over and over again.
--

From: Thomas Rast
Date: Saturday, December 18, 2010 - 6:59 pm

It runs the first bullet point for every non-match, and the second
bullet point for every match.  So it looks at every input character

Algorithmically it should be easy once you have the finite state
automaton corresponding to the regex: just verify that for every
possible non-terminal state, there is a transition for every
!isspace() character to a state other than "fail to match" or "match
the empty string".

In the implementation, it might be doable if we switch to compat/regex
on all platforms, since we then have ready access to all internal
structures regcomp() creates, including the DFA.

I'll think about at least using compat/regex for a static check of all
*builtin* patterns, which would be superior to the brute force
approach in 4/4.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--

From: Thomas Rast
Date: Saturday, December 18, 2010 - 9:17 am

We need the word_regex_check member.  Instead of adding another
argument, just pass in the whole struct for future extensibility.

Signed-off-by: Thomas Rast <trast@student.ethz.ch>
---
 diff.c |   11 ++++++-----
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/diff.c b/diff.c
index 0a43869..5fdcb15 100644
--- a/diff.c
+++ b/diff.c
@@ -778,12 +778,13 @@ static void fn_out_diff_words_aux(void *priv, char *line, unsigned long len)
 }
 
 /* This function starts looking at *begin, and returns 0 iff a word was found. */
-static int find_word_boundaries(mmfile_t *buffer, regex_t *word_regex,
+static int find_word_boundaries(mmfile_t *buffer, struct diff_words_data *diff_words,
 		int *begin, int *end)
 {
-	if (word_regex && *begin < buffer->size) {
+	if (diff_words->word_regex && *begin < buffer->size) {
 		regmatch_t match[1];
-		if (!regexec(word_regex, buffer->ptr + *begin, 1, match, 0)) {
+		if (!regexec(diff_words->word_regex, buffer->ptr + *begin,
+			     1, match, 0)) {
 			char *p = memchr(buffer->ptr + *begin + match[0].rm_so,
 					'\n', match[0].rm_eo - match[0].rm_so);
 			*end = p ? p - buffer->ptr : match[0].rm_eo + *begin;
@@ -813,7 +814,7 @@ static int find_word_boundaries(mmfile_t *buffer, regex_t *word_regex,
  * in buffer->orig.
  */
 static void diff_words_fill(struct diff_words_buffer *buffer, mmfile_t *out,
-		regex_t *word_regex)
+		struct diff_words_data *diff_words)
 {
 	int i, j;
 	long alloc = 0;
@@ -827,7 +828,7 @@ static void diff_words_fill(struct diff_words_buffer *buffer, mmfile_t *out,
 	buffer->orig_nr = 1;
 
 	for (i = 0; i < buffer->text.size; i++) {
-		if (find_word_boundaries(&buffer->text, word_regex, &i, &j))
+		if (find_word_boundaries(&buffer->text, diff_words, &i, &j))
 			return;
 
 		/* store original boundaries */
-- 
1.7.3.4.789.g74ad1

--

From: Thomas Rast
Date: Saturday, December 18, 2010 - 9:17 am

Both had an unclosed ] that ruined the safeguard against not matching
a non-space char.

Signed-off-by: Thomas Rast <trast@student.ethz.ch>
---
 userdiff.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/userdiff.c b/userdiff.c
index f9e05b5..2d54536 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -74,14 +74,14 @@
 	 "[a-zA-Z_][a-zA-Z0-9_]*"
 	 "|[-+0-9.e]+[jJlL]?|0[xX]?[0-9a-fA-F]+[lL]?"
 	 "|[-+*/<>%&^|=!]=|//=?|<<=?|>>=?|\\*\\*=?"
-	 "|[^[:space:]|[\x80-\xff]+"),
+	 "|[^[:space:]]|[\x80-\xff]+"),
 	 /* -- */
 PATTERNS("ruby", "^[ \t]*((class|module|def)[ \t].*)$",
 	 /* -- */
 	 "(@|@@|\\$)?[a-zA-Z_][a-zA-Z0-9_]*"
 	 "|[-+0-9.e]+|0[xXbB]?[0-9a-fA-F]+|\\?(\\\\C-)?(\\\\M-)?."
 	 "|//=?|[-+*/<>%&^|=!]=|<<=?|>>=?|===|\\.{1,3}|::|[!=]~"
-	 "|[^[:space:]|[\x80-\xff]+"),
+	 "|[^[:space:]]|[\x80-\xff]+"),
 PATTERNS("bibtex", "(@[a-zA-Z]{1,}[ \t]*\\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
 	 "[={}\"]|[^={}\" \t]+"),
 PATTERNS("tex", "^(\\\\((sub)*section|chapter|part)\\*{0,1}\\{.*)$",
-- 
1.7.3.4.789.g74ad1

--

From: Junio C Hamano
Date: Saturday, December 18, 2010 - 2:02 pm

Thanks.

Couldn't we have found this without your "sanity check" patch?  Are we
ignoring error returns from regcomp() in some codepath, or is it just that
we are catching them but our test suite lacks ruby and python test
vectors?
--

From: Thomas Rast
Date: Saturday, December 18, 2010 - 7:10 pm

We lacked test vectors, but we still couldn't have caught it.  We do
check for errors in regcomp():

	if (o->word_regex) {
		ecbdata.diff_words->word_regex = (regex_t *)
			xmalloc(sizeof(regex_t));
		if (regcomp(ecbdata.diff_words->word_regex,
				o->word_regex,
				REG_EXTENDED | REG_NEWLINE))
			die ("Invalid regular expression: %s",
					o->word_regex);
	}

(Now that I'm seeing this and comparing with regcomp(3), we should
actually report regerror() as part of the error message.)

The problem is that the pattern is still valid.  Consider that it was
a final two arms to the regex:

-        "|[^[:space:]|[\x80-\xff]+"),
+        "|[^[:space:]]|[\x80-\xff]+"),

In the preimage, it parses like so:

  | [^
      [:space:]|[\x80-\xff
     ]+

That is, the third [ is part of the (negated) character class.  So the
only problem is with | or [ characters in the input.  Any other
non-space character is part of the class.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--

From: Thomas Rast
Date: Saturday, December 18, 2010 - 9:17 am

The builtin word regexes should be tested with some simple examples
against simple issues, like failing to match a non-space character.
Do this in bulk.

Mainly due to a lack of language knowledge and inspiration, most of
the test cases (cpp, csharp, java, objc, pascal, php, python, ruby)
are directly based off a C operator precedence table to verify that
all operators are split correctly.  This means that they are probably
incomplete or inaccurate except for 'cpp' itself.

Still, they are good enough to already have uncovered a typo in the
python and ruby patterns.

'fortran' is based on my anecdotal knowledge of the DO10I parsing
rules, and thus probably useless.  The rest (bibtex, html, tex) are an
ad-hoc test of what I consider important splits in those languages.

Signed-off-by: Thomas Rast <trast@student.ethz.ch>
---
 t/t4034-diff-words.sh  |   20 ++++++++++++++++++++
 t/t4034/bibtex/expect  |   15 +++++++++++++++
 t/t4034/bibtex/post    |   10 ++++++++++
 t/t4034/bibtex/pre     |    9 +++++++++
 t/t4034/cpp/expect     |   36 ++++++++++++++++++++++++++++++++++++
 t/t4034/cpp/post       |   19 +++++++++++++++++++
 t/t4034/cpp/pre        |   19 +++++++++++++++++++
 t/t4034/csharp/expect  |   35 +++++++++++++++++++++++++++++++++++
 t/t4034/csharp/post    |   18 ++++++++++++++++++
 t/t4034/csharp/pre     |   18 ++++++++++++++++++
 t/t4034/fortran/expect |   10 ++++++++++
 t/t4034/fortran/post   |    5 +++++
 t/t4034/fortran/pre    |    5 +++++
 t/t4034/html/expect    |    8 ++++++++
 t/t4034/html/post      |    3 +++
 t/t4034/html/pre       |    3 +++
 t/t4034/java/expect    |   36 ++++++++++++++++++++++++++++++++++++
 t/t4034/java/post      |   19 +++++++++++++++++++
 t/t4034/java/pre       |   19 +++++++++++++++++++
 t/t4034/objc/expect    |   35 +++++++++++++++++++++++++++++++++++
 t/t4034/objc/post      |   18 ++++++++++++++++++
 t/t4034/objc/pre       |   18 ++++++++++++++++++
 t/t4034/pascal/expect  |   35 +++++++++++++++++++++++++++++++++++
 ...
From: Thomas Rast
Date: Saturday, December 18, 2010 - 9:24 am

BTW, Junio, you could move the third patch to the front and merge it
to maint.  I think it's an obvious fix to a real bug, and it does not
depend on the other patches except that the test in 4/4 will fail
without the fix.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--

From: Junio C Hamano
Date: Saturday, December 18, 2010 - 1:48 pm

Makes sense; thanks.

--

Previous thread: [PATCH v6.1 0/8] git log -L, cleaned up and (hopefully) fixed by Thomas Rast on Tuesday, December 14, 2010 - 3:54 pm. (10 messages)

Next thread: When I merge, a binary file conflict,how can I select between 2 versions? by Chunlin Zhang on Wednesday, December 15, 2010 - 1:00 am. (3 messages)