Hi, I have large-ish repository (143 MB) where git rebase failed unexpectedly (see below). If anyone is interested in looking at this, I have left a copy of the repository at http://thorn.ws/git-fatal-rebase-error.tar.gz Tommy PS: Please add me in CC as I'm no longer on this list. $ git --version git version 1.5.5.1.93.ge3f3e $ git rebase cacao my-cacao-build First, rewinding head to replay your work on top of it... Applying Import binutils-2.18 .dotest/patch:16878: trailing whitespace. .dotest/patch:17923: trailing whitespace. 0. Additional Definitions. .dotest/patch:18024: trailing whitespace. Version. .dotest/patch:18094: trailing whitespace. // .dotest/patch:18099: trailing whitespace. // fatal: corrupt patch at line 260912 Patch failed at 0001. When you have resolved this problem run "git rebase --continue". If you would prefer to skip this patch, instead run "git rebase --skip". To restore the original branch and stop rebasing run "git rebase --abort". --
Hi, I can reproduce. It seems that our diff machinery generates a file that adds a file with 10668 lines, but says 10669 lines in the hunk header. It is a .info file, so I suspect strange things going on with non-printable ASCII characters. I'm on it, Dscho --
Hi, And indeed, this is the culprit: -- snip -- BFD Index ********* ^@^H[index^@^H] * Menu: -- snap -- Note the NUL characters? Now, the thing is: git format-patch still outputs it correctly. But git-rebase pipes the output to git-am, which in turn calls git-mailsplit, which suppresses that line. Will keep you posted, Dscho --
The function fgets() has a big problem with NUL characters: it reads
them, but nobody will know if the NUL comes from the file stream, or
was appended at the end of the line.
So implement a custom read_line() function.
Noticed by Tommy Thorn.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
Sorry for the binary patch: the file t5100/nul contains NUL
characters, obviously.
BTW I do not know how much fgetc() instead of fgets() slows
down things, but I expect both to be equally fast because
they are both buffered, right?
builtin-mailinfo.c | 24 +++++++++++++-----------
builtin-mailsplit.c | 27 +++++++++++++++++++++++----
builtin.h | 1 +
t/t5100-mailinfo.sh | 9 +++++++++
t/t5100/nul | Bin 0 -> 91 bytes
5 files changed, 46 insertions(+), 15 deletions(-)
create mode 100644 t/t5100/nul
diff --git a/builtin-mailinfo.c b/builtin-mailinfo.c
index 11f154b..f0c4209 100644
--- a/builtin-mailinfo.c
+++ b/builtin-mailinfo.c
@@ -641,7 +641,7 @@ static void decode_transfer_encoding(char *line, unsigned linesize)
}
}
-static int handle_filter(char *line, unsigned linesize);
+static int handle_filter(char *line, unsigned linesize, int linelen);
static int find_boundary(void)
{
@@ -669,7 +669,7 @@ again:
"can't recover\n");
exit(1);
}
- handle_filter(newline, sizeof(newline));
+ handle_filter(newline, sizeof(newline), strlen(newline));
/* skip to the next boundary */
if (!find_boundary())
@@ -759,14 +759,14 @@ static int handle_commit_msg(char *line, unsigned linesize)
return 0;
}
-static int handle_patch(char *line)
+static int handle_patch(char *line, int len)
{
- fputs(line, patchfile);
+ fwrite(line, 1, len, patchfile);
patch_lines++;
return 0;
}
-static int handle_filter(char *line, unsigned linesize)
+static int handle_filter(char *line, unsigned linesize, int linelen)
{
static int filter = 0;
@@ -779,7 +779,7 @@ static int handle_filt...Looking at what handle_body() does for TE_BASE64 and TE_QP cases, I have to wonder if this is enough. The loop seems to stop at (*op == NUL) which follows an old assumption that each line is terminated with NUL, not the new assumption you introduced that each line's length is kept in local variable len. --
Hi, Of course! But does BASE64 and QP contain NULs? After all, even my custom read_line_with_nul() function adds a NUL IIRC. Well, the biggest problem here is my lack of time. I thought I would give Tommy a patch which kinda works, and he would actually hold through to brush it up until it shines and gets into git.git, because it is not _my_ itch. Hmmmmmmm. Ciao, Dscho --
The loop in question iterates over bytes _after_ decoding these encoded lines, and a typical reason you would encode the payload is because it contains something not safe over e-mail transfer, e.g. NUL. I think decode_transfer_encoding() also needs to become safe against NULs in the payload. --
Hi, Okay, I missed that. Ciao, Dscho --
In my experience, fgetc() is pretty fantastically slow because you have a function call for every byte (and, I gather, modern libc does thread locking for every fgetc). It's usually much faster to fread() into a buffer and then access the buffer. I don't know if that's appropriate (or matters) here, though. Have fun, Avery --
Hi, Hmpf. I hoped to get more definitive information here. Especially given that fgetc() is nothing more than a glorified fread() into a buffer, and then access the buffer. Well, at least you kind of pointed me to the _unlocked() function family. Ciao, Dscho --
Point taken.
/tmp $ for d in test1 test2 test3 test3u; do echo -n "$d: ";
/usr/bin/time ./$d </dev/zero; done
test1: 0.09user 0.05system 0:00.14elapsed 94%CPU
test2: 2.50user 0.05system 0:02.54elapsed 100%CPU
test3: 2.48user 0.06system 0:02.53elapsed 100%CPU
test3u: 1.05user 0.05system 0:01.10elapsed 99%CPU
fread is about 18x faster than fgetc(). getc() is the same speed as
fgetc(). getc_unlocked() is definitely faster than getc, but still at
least 7x slower than fread().
And if you think *that* sucks, you should try "c << cin" in C++ :)
Source code below.
Have fun,
Avery
=== test1.c ===
#include <stdio.h>
int main()
{
char buf[1024];
int i;
for (i = 0; i < 102400; i++)
fread(buf, 1, sizeof(buf), stdin);
}
=== test2.c ===
#include <stdio.h>
int main()
{
int i;
for (i = 0; i < 1024*102400; i++)
fgetc(stdin);
}
=== test3.c ===
#include <stdio.h>
int main()
{
int i;
for (i = 0; i < 1024*102400; i++)
getc(stdin);
}
=== test3u.c ===
#include <stdio.h>
int main()
{
int i;
for (i = 0; i < 1024*102400; i++)
getc_unlocked(stdin);
}
--Hi,
Well, my question was more about fgetc() vs fgets().
If you feel like it, you might benchmark this patch:
-- snipsnap --
diff --git a/builtin-mailsplit.c b/builtin-mailsplit.c
index 021dc16..5d8defd 100644
--- a/builtin-mailsplit.c
+++ b/builtin-mailsplit.c
@@ -45,13 +45,32 @@ static int is_from_line(const char *line, int len)
/* Could be as small as 64, enough to hold a Unix "From " line. */
static char buf[4096];
+/*
+ * This is an ugly hack to avoid fgetc(), which is slow, as it is locking.
+ * The argument "in" must be the same for all calls to this function!
+ */
+static int fast_fgetc(FILE *in)
+{
+ static char buf[4096];
+ static int offset = 0, len = 0;
+
+ if (offset >= len) {
+ len = fread(buf, 1, sizeof(buf), in);
+ offset = 0;
+ if (!len && feof(in))
+ return EOF;
+ }
+
+ return buf[offset++];
+}
+
/* We cannot use fgets() because our lines can contain NULs */
int read_line_with_nul(char *buf, int size, FILE *in)
{
int len = 0, c;
for (;;) {
- c = fgetc(in);
+ c = fast_fgetc(in);
buf[len++] = c;
if (c == EOF || c == '\n' || len + 1 >= size)
break;
--Wouldn't it be better to improve the implementation of getc() in glibc instead? getc() is meant to be the fast version of fgetc(), and if it isn't (anymore), then the library needs fixing. -- Sincerely, srb@cuci.nl Stephen R. van den Berg. --
Hi, fgetc() has to work reliably in a threaded environment, too. So I guess that it does all kinds of locks that slow it down, but it is not something to be "fixed". Hth, Dscho --
Looks great to me, but shouldn't you add an "inline" for this one? Also, maybe a double the buffer size. Tommy --
Hi,
No. This is an ugly hack, and not meant for application.
If that is substantially faster than the fgetc() version (and I want this
be tested in a _real-world_ scenario, i.e. not the fgetc() alone, but a
real mailsplit and a real mailinfo on a huge patch, with all three
versions: fgets(), fgetc() and fast_fgetc())), then I would prefer having
something like
struct line_reader {
FILE *in;
char buffer[4096];
int offset, int len;
char line[1024];
int linelen;
};
and corresponding functions to read lines in that setting. Maybe it would
even be better to have line be a strbuf, but I am not so sure on that.
Let's see what the tests show. Would you do them, please?
"git format-patch --stdout bla..blub | /usr/bin/time git mailsplit -o."
three times in succession should give you a good hint on the runtime.
Ciao,
Dscho
--
| Ingo Molnar | Re: [BUG] long freezes on thinkpad t60 |
| Rafael J. Wysocki | Re: [Bug 10030] Suspend doesn't work when SD card is inserted |
| Jamie Lokier | Proposal for "proper" durable fsync() and fdatasync() |
| jimmy bahuleyan | Re: how about mutual compatibility between Linux's GPLv2 and GPLv3? |
git: | |
| Martin Langhoff | Handling large files with GIT |
| Matt Mackall | Re: cleaner/better zlib sources? |
| Wink Saville | git-svn segmetation fault |
| Bill Lear | Meaning of "fatal: protocol error: bad line length character"? |
| Florin Andrei | firewall is very slow, something's wrong |
| Wijnand Wiersma | Almost success: OpenBSD on Xen |
| Marcus Andree | Re: OpenBSD kernel janitors |
| Richard Stallman | Real men don't attack straw men |
| David Miller | Re: tcp bw in 2.6 |
| Rick Jones | Re: 2.6.24 BUG: soft lockup - CPU#X |
| Patrick McHardy | [NET_SCHED 00/04]: External SFQ classifiers/flow classifier |
| Patrick McHardy | Re: [PATCH 2/2] [e1000 VLAN] Disable vlan hw accel when promiscuous mode |
