It can be considered a generalization of the Hamming distance, which is used for strings of the same length and only considers substitution edits.
Table of contents |
2 Upper and lower bounds 3 External links |
A commonly-used algorithm for computing the Levenshtein distance involves the use of an (n + 1) × (m + 1) matrix, where n and m are the lengths of the two strings. Here is pseudocode for it:
for i := 0 to n
d[i,0] := i
for j := 0 to m
d[0,j] := j
for i := 1 to n
for j := 1 to m
if s[i] = t[j] then cost := 0
else cost := 1
d[i,j] := minimum(d[i-1,j ] + 1, // insertion
d[i, j-1] + 1, // deletion
d[i-1,j-1] + cost) // substitution
return d[n,m]
Possible improvements to this algorithm include:
The invariant is that we can transform the initial segment
The Levenshtein distance has several simple upper and lower bounds that are useful in applications which compute many of them and compare them. These include:
The Algorithm
int LevenshteinDistance(char s[1..n], char t[1..m])
declare int d[0..n,0..m]
declare int i, j, cost
Possible Improvements
j
.
d[i,0]
can be moved inside the main outer loop.
cost
s can be computed in parallel, and the algorithm can be adapted to perform the minimum
function in phases to eliminate dependencies.Proof of Correctness
s[1..i]
into t[1..j]
using a minimum of d[i,j]
operations. This invariant holds since:
This proof fails to validate that the number placed in s[1..i]
can be transformed into the empty string t[1..0]
by simply dropping all i
characters. Similarly, we can transform s[1..0]
to t[1..j]
by simply adding all j
characters.
s[1..i]
to t[1..j-1]
in k
operations, then we can simply add t[j]
afterwards to get t[1..j]
in k+1
operations.
s[1..i-1]
to t[1..j]
in k
operations, then we can do the same operations on s[1..i]
and then remove the original s[i]
at the end in k+1
operations.
s[1..i-1]
to t[1..j-1]
in k
operations, we can do the same to s[1..i]
and then do a substitution of t[j]
for the original s[i]
at the end if necessary, requiring k+cost
operations.
s[1..n]
into t[1..m]
is of course the number required to transform all of s
into all of t
, and so d[n,m]
holds our result.d[i,j]
is in fact minimal; this is more difficult to show, and involves an argument by contradiction in which we assume d[i,j]
is smaller than the minimum of the three, and use this to show one of the three is not minimal.Upper and lower bounds
s
and t
, the number of characters found in s
but not in t
is a lower bound.External links