Re: what's a useful definition of full text index on a repository?

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: David Tweed <david.tweed@...>
Cc: Git Mailing List <git@...>
Date: Monday, October 1, 2007 - 1:25 pm

On 10/1/07, David Tweed <david.tweed@gmail.com> wrote:

This is what full text is used for with code:
http://lxr.mozilla.org/

It makes grep instant.

For source code you can take the full text concept further and store
parse trees. This lets you instantly find the callers of a function,
or all users of a variable.

Once you have parse trees in the database you can offer refactoring
too. I have used powerful proprietary system that used parse trees to
make complicated refactoring quite easy.

Note that a parse tree database doesn't have to be generated for all
the old revisions, it is mainly usefully for the current HEAD. Same
for the full text index. When you generate this data you end up with
lots of tiny files that need to be kept in sync with the current HEAD.
git is good for holding those files.

You want all this analysis coordinated with git so that when you
commit a change the right parts get regenerated. Linux seems to be
missing good, automated refactoring assistance. Instead the kernel
janitorial work is being done manually.

An example I noticed today. Use of pr_debug() is chaotic in the kernel
source. Many drivers have #if DEBUG their own versions of pr_debug().
Fixing everything to have consistent use of pr_debug is a refactoring
that could probably be mostly automated.

Mozilla is undergoing some massive automated rewrites.
http://blog.mozilla.com/tglek/category/decomtamination/

Full text indexing can also achieve high levels of compression as
stated in the earlier threads. It is full scale dictionary
compression. When it is being used for compression you want to apply
it to all revisions.


You would full text index the expanded source text for each revision,
not the delta. There are forms of full text indexes that record the
words position in the document. They let you search for "vision NEAR
feedback"

A good feature of this is finding when a variable or function was first added.


I do admit that these indexes are used to make functions that can be
done with brute force faster. As computers get faster the need for
these decrease. Right now the size of the kernel repo is not growing
faster than the progress of hardware. If you went back are tried to do
these things on a 386 you'd be shouting for indexes tomorrow.



-- 
Jon Smirl
jonsmirl@gmail.com
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: what's a useful definition of full text index on a repos..., Jon Smirl, (Mon Oct 1, 1:25 pm)