Introduction
The Levenshtein distance is an algorithm that returns the "difference" between two strings. That difference is actually the least amount of add a letter/delete a letter/replace a letter operations you would have to apply to the first string, to make it equal with the second. In web applications this can be useful for searches where the user types a word incorrectly, and you will want to look for close matches, instead of exact matches.Advantages of the improved version
-
S1
: the first string; -
S2
: the second string; -
N
: the length ofS1
; -
M
: the length ofS2
;
N*M
elements. My improved version uses only two arrays (there is an article describing this improvement here), making it more efficient.Another improvement is that this version also has a
limit
parameter.e.g. While searching for close matches, you certainly wouldn't want "bicycle" to match "hurricane". The difference between the two is too big.
The classic algorithm will compute the distance, even if the two words are very different. This takes A LOT more time than if you would have known that you want to search for differences smaller than
limit
.Description of the improved algorithm
- First, we see if
S1
andS2
are identical (S1==S2
). If they are, it would be a waste of time to actually calculate their difference. - After that, we calculate the absolute value of the difference between
N
andM
. If it is greater than the limit, we could be sure that the difference betweenS1
andS2
is greater that the limit, because we will have to insert at leastabs(
letters.N-M
) - Next, the classic algorithm would examine each letter of
S1
, and each letter ofS2
. But that would be a waste of time if we know that the difference should be less thanlimit
. Instead, we will examine for each letteri
ofS1
, only the letters betweeni-limit
andi+limit
ofS2
. It would be useless to examine the letters beforei-limit
, or afteri+limit
, because the difference between the firsti
letters ofS1
, and the firsti+limit+1
,i+limit+2
... andi+limit-1
,i+limit-2
... letters ofS2
would certainly be bigger thanlimit
. *e.g. You would have to makelimit+1
insertions to transform the firsti
letters ofS1
intro the firsti+limit+1
letters ofS2
.* It is also necessary to initialize the array with a huge value in thei+limit-1
andi+limit+1
positions (if these positions exist) to prevent the algorithm from choosing those values in the next step (because that part of the array is "untouched", the values would be 0). - For each
i
letter ofS1
, the algorithm sees if there was at least a computed value that is<=limit
, otherwise it returnsinfinite
(in my algorithm, 9.999.999)
Using the Code
I have implemented the algorithm in two popular languages for web development: PHP and JavaScript .The PHP version
The PHP function iscompare($string1,$string2,$limit)
:- $string1 is the first string to be compared.
- $string2 is the second string to be compared.
- $limit is optional, but highly recommended: it is the limit of the calculated distance.
Example:
echo compare("efficient","sufficient",4);
//the output will be 2
echo compare("malicious","delicious",10);
//the output will be 2
echo compare("grandma","anathema",5);
//the output will be 5
echo compare("grandma","anathema",4);
//the output will be 9999999
The JavaScript version
The JavaScript function iscompare(string1,string2,limit)
:- string1 is the first string to be compared.
- string2 is the second string to be compared.
- limit is optional, but highly recommended: it is the limit of the calculated distance.
Example:
alert(compare("efficient","sufficient",4));
//the output will be 2
alert(compare("malicious","delicious",10));
//the output will be 2
alert(compare("grandma","anathema",5));
//the output will be 5
alert(compare("grandma","anathema",4));
//the output will be 9999999
References
History
- 8 April 2012- published article with source files