The '6' as a cut-off to the levenshtein distances we list when
autocorrecting was derived in a totally scientific manner:
1) first I implemented Levenshtein-Damerau with a configurable weight of
neighbor flips ("switches"), substitutions, additions and deletions,
2) next I patched the code to sort the availablecommands by their distance
to the mispelt command,
3) as this lists way too much, I implemented a cut-off that was
configurable by an environment variable (without any safety checks, as
I did not plan to release that code anyway),
4) now comes the totally, unbelievably cunningly scientific part: I did a
self-experiment! I deliberately mispelt commands in a totally random
manner!
5) then I changed the code to actually output the distances so I could
determine a cut-off that makes sense with my type of tyops,
6) after about 15 tries of deliberate mistakes (mostly doing what I
usually do, something like "git pull" and "git log" or something like
that, but watching TV, chatting on the phone _and_ cleaning the dishes
at the same time), I found that 5 was too low and 7 too large.
The number '6' happily coincided with the number of steps I needed to come
up with the number. You see? The _perfect_ way to determine a completely
arbitrary number.
Actually, you probably see that I just made up that number and tested a
few times, and it seemed to work reasonably well.
FWIW almost the same procedure led to the weights 0, 2, 1 and 4 that you
see in help.c. The weights are basically factors with which mistakes are
punished: if you just confuse two adjacent letters, such as "psuh" instead
of "push" (which can be quite common if you use two hands, one on the left
side, and one on the right side of the keyboard, with an en-US layout so
many of us use, myself included) it costs 0.
If you write a different character than what you intended, the cost is 2.
The idea behind it is that you're more likely to miss a key than to hit
the wrong key. With the laptop I am typing this email on, it is
particularly likely that I miss a key, because there are certain
key combinations where only the first key triggers an input event, but the
second only triggers an input event when it is _released_ after the first
one. So when I type "er" real fast and happen to release the "e" key
after the "r" key, no "r" appears on my screen.
Okay, so the weight for adding a character must be smaller than
substituting a character, but why is the cost for deletion so high?
Well, I really rarely type unnecessary characters (except when writing to
the Git mailing list, arguably) so those costs must be substantially
higher than for typing the wrong character.
My original plan was to log all my tyops into a log file and analyze those
errors later, but then my initial 0, 2, 1, 4 and 6 constants worked well
enough for me that I did not bother.
Satisfied?
Ciao,
Dscho
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html