-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
any comparation with lucene's Levenshtein Automaton ? #72
Comments
Something like http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata or https://issues.apache.org/jira/browse/LUCENE-2507 ? From what I understand from Michael McCandless post: Prior to 4.0, FuzzyQuery took a brute force approach: it visits every single unique term in the dictionary, computes the Levenshtein distance for it, and accepts the term if the edit distance<=maximum edit distance. From 4.0 they were pre-calculating a Levenshtein automaton, that accepts only the terms within edit distance N. And then at query time they were intersecting that automaton with all the terms in the dictionary. This would mean that they save time, because the don't need to calculate the Levenshtein distance for every single term, but it seems they still need to intersect the pre-built Levenshtein automaton with every single term in the dictionary, whereas SymSpell has to calculate the Levenshtein distance only for 0.0042% to 0.016% of the vocabulary. It would sure be interesting to extend the benchmark, to see how the two algorithms compare in reality. |
This is not the case if the dictionary is also indexed in a trie structure. In that case, intersecting the automata with the dictionary is a simple recursive algorithm which only follows branches of the dictionary with corresponding transitions in the state machine. |
It seems like somebody did indeed benchmark SymSpell vs. Lucene's spelling correction: |
http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html
The text was updated successfully, but these errors were encountered: