Right, in the demo I make an extra pass after the inverted indexing
step to prune the index -- which means eliminating the common lines
*entirely* from the index (so they don't get attributed to a random
file) *and* decrementing all the file sizes by 1. That way the
similarity scores shouldn't get skewed.
And as you mentioned we could bump the threshold from 1 to some other
small integer. Intuitively I guess you could say it is common to copy
a file to 2 places or 3 places, and you don't want all the lines to
get thrown out because of that. But usually you don't copy a file to
10 or 50 places.
Andy
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Previous message: [thread] [date] [author] Next message: [thread] [date] [author]