Implement Optimal String Alignment (OSA) Distance Algorithm #464
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This pull request introduces an implementation of the Optimal String Alignment (OSA) Distance algorithm. This string metric is used to measure the difference between two sequences (typically strings) by calculating the minimum number of operations required to transform one string into another. The operations considered include insertion, deletion, substitution, and the transposition of two adjacent characters.
What is the Optimal String Alignment (OSA) Distance?
The Optimal String Alignment (OSA) distance, also referred to as the restricted Damerau-Levenshtein distance, is a variation of the classic Levenshtein distance with an additional operation—transposition of adjacent characters. This metric is particularly useful in scenarios where such transpositions are common, such as typographical errors or spelling mistakes.
Key Operations:
Difference from Damerau-Levenshtein Distance:
While the OSA distance allows for transpositions like the general Damerau-Levenshtein distance, it differs in that it restricts the transposition to be a single operation, ensuring that the same characters are not involved in multiple operations in the same position. This makes OSA more suitable for applications where such operations are expected to be simple, like correcting minor spelling errors.
How It Works
The algorithm uses dynamic programming to compute the distance. The main idea is to build a matrix where each cell
(i, j)
represents the OSA distance between the firsti
characters of strings1
and the firstj
characters of strings2
.The algorithm proceeds as follows:
Initialization:
Filling the Matrix:
s1
ands2
, calculate the cost of insertion, deletion, substitution, and transposition.Final Output:
s1
ands2
.Example
Consider two strings:
"example"
and"exmaple"
.The OSA distance between these two strings is 1 because you can transform
"exmaple"
into"example"
by a single transposition of the characters 'm' and 'a'.Another example:
Here, the distance is 3 due to the following operations:
Motivation
Why Use OSA Distance?
The OSA distance is particularly advantageous in applications where adjacent character transpositions are common. This is typically the case in the following scenarios:
Time Complexity
The time complexity of the OSA distance algorithm is
O(n * m)
, wheren
is the length of the first string andm
is the length of the second string. This makes it efficient for moderate-length strings but may become computationally expensive for very long strings. However, this complexity is comparable to other similar algorithms, such as Levenshtein and Damerau-Levenshtein, making OSA a practical choice for many real-world applications.