I ask help to the list because my knowledge on this is not enough.
Currently qgit uses, socket based, QProcess class to read data from
'git rev-list' when loading the repository at startup.The time it takes to read, without processing, the whole Linux tree
with this approach it's almost _double_ of the time it takes 'git
rev-list' to write to a file:$git rev-list --header --boundary --parents --topo-order HEAD >> tmp.txt
We are talking of about 7s against less then 4s, on my box (warm cache).
So I have a patch to make 'git rev-list' writing into a temporary file
and then read it in memory, perhaps it's not the cleaner way, but it's
faster, about 1s less.I have browsed Qt sources and found that QProcess uses internal
buffers that are then copied again before to be used by the
application. File approach uses a call to read() /fread() buired
inside the Qt's QFile class, and no intermediate buffers, so perhaps
this could be the reason the second way it's faster.Anyway there are some issues:
1) File tmp.txt is deleted as soon as read, but this is not enough
sometimes to avoid a costly and wasteful write access to disk by the
OS. What is the easiest, portable way to create a temporary 'in memory
only' file, with no disk access? Or at least delay the HD write access
enough to be able to read and delete the file before the fist block of
tmp.txt is flushed to disk?2) There is a faster/cleaner (and *safe* ) way to access directly 'git
rev-list' output, something like (just as an example):$git rev-list --header --boundary --parents --topo-order HEAD >> /dev/mem
Or something similar, possibly _simple_ and _portable_ , so to be able
to copy the big amount of 'git rev-list' output just once (about 30MB
with current tree).3) Other suggestions? ;-)
Thanks
Marco
-
The revision listing machinery is fairly well isolated behind some
pretty clean APIs in Git. Why not link qgit against libgit.a and
just do the revision listing in process?--
Shawn.
-
Hi,
Because, depending on what you do, the revision machinery is not
reentrable. For example, if you filter by filename, the history is
rewritten in-memory to simulate a history where just that filename was
tracked, and nothing else. These changes are not cleaned up after calling
the internal revision machinery.Hth,
Dscho-
Well, it really wouldn't be that hard to add a new library interface to
"reset object state". We could fairly trivially either:- walk all objects in the object hashes and clear all the flags.
- just clear all objects _and_ the hashes.Yes, it implies a small amount of manual "management", but considering
that the reason it needs to be manual is that the functions simply _need_
the state, is that such a big deal?Linus
-
So the library approach sounds like the best?
Of course in this case the producer git-rev-list and the receiver use
the same address space.In the case of a temporary file data is first copied to OS disk cache
buffers and then again to userspace, in qgit address space. But the
real pain is that the temporary file is always flushed to disk after
4-5 seconds from creation, also if under heavy read/write activity.
This is a problem for big repos. I really don't know how to workaround
this useless disk flush.Finally, what about using some kind of shared memory at run time,
instead of _sharing_ developer libraries ;-) ? is it too messy?Probably the concurrent reading while writing is possible without
syncro if the reader understands that a sequence of _two_ or more \0
it means the end of current write stream if producer is still running
or the end of data if producer is not running anymore. I use a similar
approach in the 'temporary file' patch where receiver is able to read
while producer writes without explicit synchronization. In that case a
read() of a block smaller then maximum with producer still running is
used as the 'break' condition in the receiver while loop.Marco
-
Where can I found some documentation (yes I know RTFS, but...) or,
better, an example of using the API to read git-rev-list output?if it is possible I also would like to avoid to mess with internal git
API's, of course *if it is possible* ;-)Thanks
Marco
-
builtin-rev-list.c. :-)
I think all you may need is:
#include "revision.h"
...
struct rev_info revs;
init_revisions(&revs, prefix);
revs.abbrev = 0;
revs.commit_format = CMIT_FMT_UNSPECIFIED;
argc = setup_revisions(argc, argv, &revs, NULL);where argv just a char** of the arguments you were going to hand
to rev-list on the command line.then get the data back:
static void show_commit(struct commit *commit)
{
const char * hex = sha1_to_hex(commit->object.sha1);
... copy from hex to your own structures ...
}static void show_object(struct object_array_entry *p)
{
/* do nothing */
}prepare_revision_walk(&revs);
traverse_commit_list(&revs, show_commit, show_object);--
Shawn.
-
You'll also need to call:
setup_git_directory();
Although now that I think about it the library may not be enough
of a library. Some data (e.g. commits) will stay in memory forever
once loaded. Pack files won't be released once read; a pack recently
made available while the application is running may not get noticed.Perhaps there is some fast IPC API supported by Qt that you could
use to run the revision listing outside of the main UI process,
to eliminate the bottlenecks you are seeing and remove the problems
noted above? One that doesn't involve reading from a pipe I mean...--
Shawn.
-
Why not just fork() + exec() and read from the filedescriptor? You can
up the output buffer of the forked program to something suitable, which
means the OS will cache it for you until you copy it to a buffer in qgit
(i.e., read from the descriptor).--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-
Please, what do you mean with "something suitable"? How can I redirect
the output to a memory buffer or to a file that the OS will cache
*until* I've copied it?If I redirect to a 'normal' file, this will be flushed by OS after
some time, normally few seconds.Could you please post links with examples/docs about this kind of
implementation?Thanks
Marco
-
There is a very handy solution to this problem called "tmpfs". It
should already be mounted at /tmp. Put tmp.txt there and your problem
will go away.Cheers,
- Michael
-
Thanks Michael,
It seems to work! patch pushed.
Marco
P.S: I've looked again to Shawn idea (and code) of linking qgit
against libgit.a but I found these two difficult points:- traverse_commit_list(&revs, show_commit, show_object) is blocking,
i.e. the GUI will stop responding for few seconds while traversing the
list. This is easily and transparently solved by the OS scheduler if
an external process is used for git-rev-list. To solve this in qgit I
have two ways: 1) call QEventLoop() once in a while from inside
show_commit()/ show_object() to process pending events 2) Use a
separate thread (QThread class). The first idea is not nice, the
second opens a whole a new set of problems and it's a big amount of
not trivial new code to add.- traverse_commit_list() having an internal state it's not
re-entrant. git-rev-list it's used to load main view data but also
file history in another tab, and the two calls _could_ be ran
concurrently. With external process I simply run two instances of
DataLoader class and consequently two external git-rev-list processes,
but If I link against libgit.a that would be a big problem.
-
Hi,
Could somebody remind me why different processes are needed? I thought
that the revision machinery should be used directly, by linking to
libgit.a...Ciao,
Dscho-
You wrote:
--%<--%<--%<--
Because, depending on what you do, the revision machinery is not
reentrable. For example, if you filter by filename, the history is
rewritten in-memory to simulate a history where just that filename was
tracked, and nothing else. These changes are not cleaned up after
calling the internal revision machinery.
--%<--%<--%<--When I wrote the above suggestion, I hadn't read the posts following the
email where I cut this text from (where Linus said "we can add a 'reset'
thingie to the revision walking machinery" and Marco replied with some
more questions).--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-
Hi,
Yes. The reset thingie is already in place: clear_commit_marks(). It would
have to be enhanced a little, though:1) the function rewrite_parents(), should add another flag, HALFORPHANED,
and
2) clear_commit_marks() should unset the "parsed" flag of the commits for
which HALFORPHANED is reset.-- snip --
diff --git a/commit.c b/commit.c
index d5103cd..fd225c8 100644
--- a/commit.c
+++ b/commit.c
@@ -431,6 +431,10 @@ void clear_commit_marks(struct commit *commit, unsigned int mark)
{
struct commit_list *parents;+ /* were parents rewritten? */
+ if ((mark & commit->object.flags) & HALFORPHANED)
+ commit->object.parsed = 0;
+
commit->object.flags &= ~mark;
parents = commit->parents;
while (parents) {
diff --git a/revision.c b/revision.c
index 993bb66..461ee06 100644
--- a/revision.c
+++ b/revision.c
@@ -1097,6 +1097,7 @@ static void rewrite_parents(struct rev_info *revs, struct commit *commit)
struct commit_list *parent = *pp;
if (rewrite_one(revs, &parent->item) < 0) {
*pp = parent->next;
+ commit->object.flags |= HALFORPHANED;
continue;
}
pp = &parent->next;
diff --git a/revision.h b/revision.h
index 3adab95..544238c 100644
--- a/revision.h
+++ b/revision.h
@@ -9,6 +9,7 @@
#define BOUNDARY (1u<<5)
#define BOUNDARY_SHOW (1u<<6)
#define ADDED (1u<<7) /* Parents already parsed and added? */
+#define HALFORPHANED (1u<<8) /* parents were rewritten */struct rev_info;
struct log_info;
-- snap --Note that this is just the idea. This particular implementation opens a
gaping memory leak, since the buffer of the commit is not free()d, and a
reparse would probably not pick up on the fact that the parent commits are
already in memory.Ciao,
Dscho-
Qt it's very fast in reading from files, also git-rev-list is fast in
write to a file...the problem is I would not want the file to be saved
on disk, but stay cached in the OS memory for the few seconds needed
to be written and read back, and then deleted. It's a kind of shared
memory at the end. But I don't know how to realize it.Also let git-rev-list to write directly in qgit process address space
would be nice, indeed very nice.Marco
-
On a modern Linux (probably your largest target audience) a small
file which has a very short lifespan (few seconds) is unlikey to
hit the platter. Most filesystems will put the data into buffer
cache and delay writing to disk because temporary files are so
common on UNIX.And ugly. :-)
SysV IPC (shared memory, semaphores) are messy and difficult to
get right. mmap against a random file in the filesystem tends
to work better on those systems which support it well, provided
that the file isn't on a network mount. But again you still need
semaphores or something like them to control access to the data in
the mmap'd region.I was thinking that maybe if Qt had a bounded buffer available for
use between a process and its child, that you could use that to run
your own "qgit-rev-list" child and get the data back more quickly,
without the need for a temporary file. But it doesn't look like
they have one. Oh well.Your current temporary file approach is probably the best you can
get, and has the simplest possible implementation. Doing better
would require linking against libgit.a, and getting the core Git
hackers to make at least the revision machinery more useful in a
library setting.--
Shawn.
-
| Ingo Molnar | [patch 12/13] syslets: x86: optimized copy_uatom() |
| Greg Kroah-Hartman | [PATCH 017/196] aoechr: Convert from class_device to device |
| Yinghai Lu | Re: 2.6.26, PAT and AMD family 6 |
| Jan Engelhardt | intel iommu (Re: -mm merge plans for 2.6.23) |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Natalie Protasevich | [BUG] New Kernel Bugs |
