Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reproducing benchmark results #32

Open
maxbachmann opened this issue Jan 6, 2022 · 3 comments
Open

reproducing benchmark results #32

maxbachmann opened this issue Jan 6, 2022 · 3 comments

Comments

@maxbachmann
Copy link
Contributor

maxbachmann commented Jan 6, 2022

I tried to reproduce the benchmarks in your readme. However my results running the same benchmark are greatly different from the results you achieved. Note I scaled the benchmarks down from 500000 to 5000, since this enough to get a good idea of the performance difference without spending all day running the benchmark.
On Python 3.9 I get:

>>> import timeit
>>> timeit.timeit("damerau_levenshtein_distance('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from pyxdameraulevenshtein import damerau_levenshtein_distance', number=5000)
0.30914585100072145
>>> timeit.timeit("dameraulevenshtein('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from dameraulevenshtein import dameraulevenshtein', number=5000)
2.0448212559995227
>>> timeit.timeit("difflib.SequenceMatcher(None, 'e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7').ratio()", 'import difflib', number=5000)
0.29983263299982355

and on Python2.7:

>>> import timeit
>>> timeit.timeit("damerau_levenshtein_distance('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from pyxdameraulevenshtein import damerau_levenshtein_distance', number=5000)
0.4308760166168213
>>> timeit.timeit("dameraulevenshtein('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from dameraulevenshtein import dameraulevenshtein', number=5000)
1.8721919059753418
>>> timeit.timeit("difflib.SequenceMatcher(None, 'e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7').ratio()", 'import difflib', number=5000)
0.3515639305114746

So the performance differences in your benchmarks appear to differentiate by one order of magnitude for both pyxdameraulevenshtein <-> Michael Homer's implementation and pyxdameraulevenshtein <-> difflib. My best guess would be that the Python version you created them on (they existed since the start when the library was Python 2.4+) was still significantly slower than more recent versions of Python. It would probably make sense to redo these benchmarks on a more recent version of Python.

@gfairchild
Copy link
Collaborator

I'm seeing similar performance as you:

Python 3.7.6 (default, Jan  8 2020, 13:42:34)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> timeit.timeit("damerau_levenshtein_distance('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from pyxdameraulevenshtein import damerau_levenshtein_distance', number=500000)
28.597779377000002
>>> timeit.timeit("difflib.SequenceMatcher(None, 'e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7').ratio()", 'import difflib', number=500000)
31.961154162

It could be that the difflib implementation was improved to be much faster. Or something changed in the CPython interpreter. Or I inadvertently introduced some performance regression. Not sure.

@maxbachmann
Copy link
Contributor Author

maxbachmann commented Sep 16, 2022

I finally did come around to implement the optimal string alignment distance in RapidFuzz and compare the performance.

osa

I expected it to be faster due to the better time complexity (O(N*M) vs O([N/64]*M)). However the large time difference for short sequences came as a surprise. It appears pyxDamerauLevenshtein has a pretty large constant factor.

@gfairchild
Copy link
Collaborator

I honestly don't really have the time to maintain this library because I'm no longer using it in my research, so I'm admittedly not quite sure why performance is dropping off like this. At the time I wrote it, it was by far the most performant DL distance library for Python (and the only one that supported unicode), but this is clearly not the case anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants