login
Header Space

 
 

Re: [PATCH] mailsplit and mailinfo: gracefully handle NUL characters

Previous thread: Why do git submodules require manual checkouts and commits? by skillzero on Friday, May 16, 2008 - 12:16 am. (9 messages)

Next thread: [PATCH] submodule update: add convenience option --init by Johannes Schindelin on Friday, May 16, 2008 - 6:23 am. (2 messages)
To: <git@...>
Date: Thursday, May 15, 2008 - 3:27 am

Hi,

I have large-ish repository (143 MB) where git rebase failed 
unexpectedly (see below). If anyone is interested in looking at this, I 
have left a copy of the repository at 
http://thorn.ws/git-fatal-rebase-error.tar.gz

Tommy
PS: Please add me in CC as I'm no longer on this list.


$ git --version
git version 1.5.5.1.93.ge3f3e

$ git rebase cacao my-cacao-build
First, rewinding head to replay your work on top of it...
Applying Import binutils-2.18
.dotest/patch:16878: trailing whitespace.
 
.dotest/patch:17923: trailing whitespace.
  0. Additional Definitions.
.dotest/patch:18024: trailing whitespace.
       Version.
.dotest/patch:18094: trailing whitespace.
//
.dotest/patch:18099: trailing whitespace.
//
fatal: corrupt patch at line 260912
Patch failed at 0001.

When you have resolved this problem run "git rebase --continue".
If you would prefer to skip this patch, instead run "git rebase --skip".
To restore the original branch and stop rebasing run "git rebase --abort".


--
To: Tommy Thorn <tommy-git@...>
Cc: <git@...>
Date: Friday, May 16, 2008 - 6:41 am

Hi,


I can reproduce.  It seems that our diff machinery generates a file that 
adds a file with 10668 lines, but says 10669 lines in the hunk header.

It is a .info file, so I suspect strange things going on with 
non-printable ASCII characters.

I'm on it,
Dscho
--
To: Tommy Thorn <tommy-git@...>
Cc: <git@...>
Date: Friday, May 16, 2008 - 7:01 am

Hi,


And indeed, this is the culprit:

-- snip --
BFD Index
*********

^@^H[index^@^H]
* Menu:

-- snap --

Note the NUL characters?

Now, the thing is: git format-patch still outputs it correctly.  But 
git-rebase pipes the output to git-am, which in turn calls git-mailsplit, 
which suppresses that line.

Will keep you posted,
Dscho

--
To: Tommy Thorn <tommy-git@...>
Cc: <git@...>
Date: Friday, May 16, 2008 - 9:03 am

The function fgets() has a big problem with NUL characters: it reads
them, but nobody will know if the NUL comes from the file stream, or
was appended at the end of the line.

So implement a custom read_line() function.

Noticed by Tommy Thorn.

Signed-off-by: Johannes Schindelin &lt;johannes.schindelin@gmx.de&gt;
---

	Sorry for the binary patch: the file t5100/nul contains NUL
	characters, obviously.

	BTW I do not know how much fgetc() instead of fgets() slows
	down things, but I expect both to be equally fast because
	they are both buffered, right?

 builtin-mailinfo.c  |   24 +++++++++++++-----------
 builtin-mailsplit.c |   27 +++++++++++++++++++++++----
 builtin.h           |    1 +
 t/t5100-mailinfo.sh |    9 +++++++++
 t/t5100/nul         |  Bin 0 -&gt; 91 bytes
 5 files changed, 46 insertions(+), 15 deletions(-)
 create mode 100644 t/t5100/nul

diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index 11f154b..f0c4209 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -641,7 +641,7 @@ static void decode_transfer_encoding(char *line, unsigned linesize)
 	}
 }
 
-static int handle_filter(char *line, unsigned linesize);
+static int handle_filter(char *line, unsigned linesize, int linelen);
 
 static int find_boundary(void)
 {
@@ -669,7 +669,7 @@ again:
 					"can't recover\n");
 			exit(1);
 		}
-		handle_filter(newline, sizeof(newline));
+		handle_filter(newline, sizeof(newline), strlen(newline));
 
 		/* skip to the next boundary */
 		if (!find_boundary())
@@ -759,14 +759,14 @@ static int handle_commit_msg(char *line, unsigned linesize)
 	return 0;
 }
 
-static int handle_patch(char *line)
+static int handle_patch(char *line, int len)
 {
-	fputs(line, patchfile);
+	fwrite(line, 1, len, patchfile);
 	patch_lines++;
 	return 0;
 }
 
-static int handle_filter(char *line, unsigned linesize)
+static int handle_filter(char *line, unsigned linesize, int linelen)
 {
 	static int filter = 0;
 
@@ -779,7 +779,7 @@ static int handle_filt...
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Tommy Thorn <tommy-git@...>, <git@...>
Date: Wednesday, May 21, 2008 - 2:08 pm

Looking at what handle_body() does for TE_BASE64 and TE_QP cases, I have
to wonder if this is enough.  The loop seems to stop at (*op == NUL) which
follows an old assumption that each line is terminated with NUL, not the
new assumption you introduced that each line's length is kept in local
variable len.
--
To: Junio C Hamano <junio@...>
Cc: Tommy Thorn <tommy-git@...>, <git@...>
Date: Thursday, May 22, 2008 - 6:38 am

Hi,


Of course!  But does BASE64 and QP contain NULs?  After all, even my 
custom read_line_with_nul() function adds a NUL IIRC.

Well, the biggest problem here is my lack of time.  I thought I would give 
Tommy a patch which kinda works, and he would actually hold through to 
brush it up until it shines and gets into git.git, because it is not _my_ 
itch.

Hmmmmmmm.

Ciao,
Dscho


--
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Tommy Thorn <tommy-git@...>, <git@...>
Date: Thursday, May 22, 2008 - 1:44 pm

The loop in question iterates over bytes _after_ decoding these encoded
lines, and a typical reason you would encode the payload is because it
contains something not safe over e-mail transfer, e.g. NUL.

I think decode_transfer_encoding() also needs to become safe against NULs
in the payload.

--
To: Junio C Hamano <gitster@...>
Cc: Tommy Thorn <tommy-git@...>, <git@...>
Date: Friday, May 23, 2008 - 7:21 am

Hi,


Okay, I missed that.

Ciao,
Dscho

--
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Tommy Thorn <tommy-git@...>, <git@...>
Date: Friday, May 16, 2008 - 10:03 am

In my experience, fgetc() is pretty fantastically slow because you
have a function call for every byte (and, I gather, modern libc does
thread locking for every fgetc).  It's usually much faster to fread()
into a buffer and then access the buffer.  I don't know if that's
appropriate (or matters) here, though.

Have fun,

Avery
--
To: Avery Pennarun <apenwarr@...>
Cc: Tommy Thorn <tommy-git@...>, <git@...>
Date: Friday, May 16, 2008 - 10:32 am

Hi,


Hmpf.  I hoped to get more definitive information here.  Especially given 
that fgetc() is nothing more than a glorified fread() into a buffer, and 
then access the buffer.

Well, at least you kind of pointed me to the _unlocked() function family.

Ciao,
Dscho

--
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Tommy Thorn <tommy-git@...>, <git@...>
Date: Friday, May 16, 2008 - 10:56 am

Point taken.

/tmp $ for d in test1 test2 test3 test3u; do echo -n "$d: ";
/usr/bin/time ./$d &lt;/dev/zero; done
test1: 0.09user 0.05system 0:00.14elapsed 94%CPU
test2: 2.50user 0.05system 0:02.54elapsed 100%CPU
test3: 2.48user 0.06system 0:02.53elapsed 100%CPU
test3u: 1.05user 0.05system 0:01.10elapsed 99%CPU

fread is about 18x faster than fgetc().  getc() is the same speed as
fgetc().  getc_unlocked() is definitely faster than getc, but still at
least 7x slower than fread().

And if you think *that* sucks, you should try "c &lt;&lt; cin" in C++ :)

Source code below.

Have fun,

Avery


=== test1.c ===
#include &lt;stdio.h&gt;

int main()
{
    char buf[1024];
    int i;
    for (i = 0; i &lt; 102400; i++)
        fread(buf, 1, sizeof(buf), stdin);
}

=== test2.c ===
#include &lt;stdio.h&gt;

int main()
{
    int i;
    for (i = 0; i &lt; 1024*102400; i++)
        fgetc(stdin);
}

=== test3.c ===
#include &lt;stdio.h&gt;

int main()
{
    int i;
    for (i = 0; i &lt; 1024*102400; i++)
        getc(stdin);
}

=== test3u.c ===
#include &lt;stdio.h&gt;

int main()
{
    int i;
    for (i = 0; i &lt; 1024*102400; i++)
        getc_unlocked(stdin);
}
--
To: Avery Pennarun <apenwarr@...>
Cc: Tommy Thorn <tommy-git@...>, <git@...>
Date: Friday, May 16, 2008 - 7:59 pm

Hi,


Well, my question was more about fgetc() vs fgets().

If you feel like it, you might benchmark this patch:

-- snipsnap --
diff --git a/builtin-mailsplit.c b/builtin-mailsplit.c
index 021dc16..5d8defd 100644
--- a/builtin-mailsplit.c
+++ b/builtin-mailsplit.c
@@ -45,13 +45,32 @@ static int is_from_line(const char *line, int len)
 /* Could be as small as 64, enough to hold a Unix "From " line. */
 static char buf[4096];
 
+/*
+ *  This is an ugly hack to avoid fgetc(), which is slow, as it is locking.
+ *  The argument "in" must be the same for all calls to this function!
+ */
+static int fast_fgetc(FILE *in)
+{
+	static char buf[4096];
+	static int offset = 0, len = 0;
+
+	if (offset &gt;= len) {
+		len = fread(buf, 1, sizeof(buf), in);
+		offset = 0;
+		if (!len &amp;&amp; feof(in))
+			return EOF;
+	}
+
+	return buf[offset++];
+}
+
 /* We cannot use fgets() because our lines can contain NULs */
 int read_line_with_nul(char *buf, int size, FILE *in)
 {
 	int len = 0, c;
 
 	for (;;) {
-		c = fgetc(in);
+		c = fast_fgetc(in);
 		buf[len++] = c;
 		if (c == EOF || c == '\n' || len + 1 &gt;= size)
 			break;
--
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Avery Pennarun <apenwarr@...>, Tommy Thorn <tommy-git@...>, <git@...>
Date: Saturday, May 17, 2008 - 6:07 am

Wouldn't it be better to improve the implementation of getc()
in glibc instead?

getc() is meant to be the fast version of fgetc(), and if it isn't
(anymore), then the library needs fixing.
-- 
Sincerely,                                                          srb@cuci.nl
           Stephen R. van den Berg.
--
To: Stephen R. van den Berg <srb@...>
Cc: Avery Pennarun <apenwarr@...>, Tommy Thorn <tommy-git@...>, <git@...>
Date: Saturday, May 17, 2008 - 6:18 am

Hi,



fgetc() has to work reliably in a threaded environment, too.  So I guess 
that it does all kinds of locks that slow it down, but it is not something 
to be "fixed".

Hth,
Dscho

--
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Avery Pennarun <apenwarr@...>, <git@...>
Date: Friday, May 16, 2008 - 8:06 pm

Looks great to me, but shouldn't you add an "inline" for this one? Also, 
maybe a double the buffer size.

Tommy

--
To: Tommy Thorn <tommy-git@...>
Cc: Avery Pennarun <apenwarr@...>, <git@...>
Date: Friday, May 16, 2008 - 8:26 pm

Hi,


No.  This is an ugly hack, and not meant for application.

If that is substantially faster than the fgetc() version (and I want this 
be tested in a _real-world_ scenario, i.e. not the fgetc() alone, but a 
real mailsplit and a real mailinfo on a huge patch, with all three 
versions: fgets(), fgetc() and fast_fgetc())), then I would prefer having 
something like

	struct line_reader {
		FILE *in;
		char buffer[4096];
		int offset, int len;
		char line[1024];
		int linelen;
	};

and corresponding functions to read lines in that setting.  Maybe it would 
even be better to have line be a strbuf, but I am not so sure on that.

Let's see what the tests show.  Would you do them, please?  
"git format-patch --stdout bla..blub | /usr/bin/time git mailsplit -o." 
three times in succession should give you a good hint on the runtime.

Ciao,
Dscho


--
Previous thread: Why do git submodules require manual checkouts and commits? by skillzero on Friday, May 16, 2008 - 12:16 am. (9 messages)

Next thread: [PATCH] submodule update: add convenience option --init by Johannes Schindelin on Friday, May 16, 2008 - 6:23 am. (2 messages)
speck-geostationary