Similarity returns a number that is 1.0f or less (including negative numbers)
based on how similar the Term is compared to a target term. It returns
exactly 0.0f when
editDistance < maximumEditDistance
Otherwise it returns:
1 - (editDistance / length)
where length is the length of the shortest term (text or target) including a
prefix that are identical and editDistance is the Levenshtein distance for
the two words.
Embedded within this algorithm is a fail-fast Levenshtein distance
algorithm. The fail-fast algorithm differs from the standard Levenshtein
distance algorithm in that it is aborted if it is discovered that the
mimimum distance between the words is greater than some threshold.
To calculate the maximum distance threshold we use the following formula:
(1 - minimumSimilarity) * length
where length is the shortest term including any prefix that is not part of the
similarity comparision. This formula was derived by solving for what maximum value
of distance returns false for the following statements:
similarity = 1 - ((float)distance / (float) (prefixLength + Math.min(textlen, targetlen)));
return (similarity > minimumSimilarity);
where distance is the Levenshtein distance for the two words.
Levenshtein distance (also known as edit distance) is a measure of similiarity
between two strings where the distance is measured as the number of character
deletions, insertions or substitutions required to transform one string to
the other string.
final int m = target.length();
final int n = text.length();
if (n == 0) {
//we don't have anything to compare. That means if we just add
//the letters for m we get the new word
return prefix.length() == 0 ? 0.0f : 1.0f - ((float) m / prefix.length());
}
if (m == 0) {
return prefix.length() == 0 ? 0.0f : 1.0f - ((float) n / prefix.length());
}
final int maxDistance = getMaxDistance(m);
if (maxDistance < Math.abs(m-n)) {
//just adding the characters of m to n or vice-versa results in
//too many edits
//for example "pre" length is 3 and "prefixes" length is 8. We can see that
//given this optimal circumstance, the edit distance cannot be less than 5.
//which is 8-3 or more precisesly Math.abs(3-8).
//if our maximum edit distance is 4, then we can discard this word
//without looking at it.
return 0.0f;
}
//let's make sure we have enough room in our array to do the distance calculations.
if (d[0].length <= m) {
growDistanceArray(m);
}
// init matrix d
for (int i = 0; i <= n; i++) d[i][0] = i;
for (int j = 0; j <= m; j++) d[0][j] = j;
// start computing edit distance
for (int i = 1; i <= n; i++) {
int bestPossibleEditDistance = m;
final char s_i = text.charAt(i - 1);
for (int j = 1; j <= m; j++) {
if (s_i != target.charAt(j-1)) {
d[i][j] = min(d[i-1][j], d[i][j-1], d[i-1][j-1])+1;
}
else {
d[i][j] = min(d[i-1][j]+1, d[i][j-1]+1, d[i-1][j-1]);
}
bestPossibleEditDistance = Math.min(bestPossibleEditDistance, d[i][j]);
}
//After calculating row i, the best possible edit distance
//can be found by found by finding the smallest value in a given column.
//If the bestPossibleEditDistance is greater than the max distance, abort.
if (i > maxDistance && bestPossibleEditDistance > maxDistance) { //equal is okay, but not greater
//the closest the target can be to the text is just too far away.
//this target is leaving the party early.
return 0.0f;
}
}
// this will return less than 0.0 when the edit distance is
// greater than the number of characters in the shorter word.
// but this was the formula that was previously used in FuzzyTermEnum,
// so it has not been changed (even though minimumSimilarity must be
// greater than 0.0)
return 1.0f - ((float)d[n][m] / (float) (prefix.length() + Math.min(n, m)));