reproducing benchmark results #32

maxbachmann · 2022-01-06T15:28:49Z

I tried to reproduce the benchmarks in your readme. However my results running the same benchmark are greatly different from the results you achieved. Note I scaled the benchmarks down from 500000 to 5000, since this enough to get a good idea of the performance difference without spending all day running the benchmark.
On Python 3.9 I get:

>>> import timeit
>>> timeit.timeit("damerau_levenshtein_distance('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from pyxdameraulevenshtein import damerau_levenshtein_distance', number=5000)
0.30914585100072145
>>> timeit.timeit("dameraulevenshtein('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from dameraulevenshtein import dameraulevenshtein', number=5000)
2.0448212559995227
>>> timeit.timeit("difflib.SequenceMatcher(None, 'e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7').ratio()", 'import difflib', number=5000)
0.29983263299982355

and on Python2.7:

>>> import timeit
>>> timeit.timeit("damerau_levenshtein_distance('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from pyxdameraulevenshtein import damerau_levenshtein_distance', number=5000)
0.4308760166168213
>>> timeit.timeit("dameraulevenshtein('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from dameraulevenshtein import dameraulevenshtein', number=5000)
1.8721919059753418
>>> timeit.timeit("difflib.SequenceMatcher(None, 'e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7').ratio()", 'import difflib', number=5000)
0.3515639305114746

So the performance differences in your benchmarks appear to differentiate by one order of magnitude for both pyxdameraulevenshtein <-> Michael Homer's implementation and pyxdameraulevenshtein <-> difflib. My best guess would be that the Python version you created them on (they existed since the start when the library was Python 2.4+) was still significantly slower than more recent versions of Python. It would probably make sense to redo these benchmarks on a more recent version of Python.

The text was updated successfully, but these errors were encountered:

gfairchild · 2022-01-06T21:30:40Z

I'm seeing similar performance as you:

Python 3.7.6 (default, Jan  8 2020, 13:42:34)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> timeit.timeit("damerau_levenshtein_distance('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from pyxdameraulevenshtein import damerau_levenshtein_distance', number=500000)
28.597779377000002
>>> timeit.timeit("difflib.SequenceMatcher(None, 'e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7').ratio()", 'import difflib', number=500000)
31.961154162

It could be that the difflib implementation was improved to be much faster. Or something changed in the CPython interpreter. Or I inadvertently introduced some performance regression. Not sure.

maxbachmann · 2022-09-16T13:52:50Z

I finally did come around to implement the optimal string alignment distance in RapidFuzz and compare the performance.

I expected it to be faster due to the better time complexity (O(N*M) vs O([N/64]*M)). However the large time difference for short sequences came as a surprise. It appears pyxDamerauLevenshtein has a pretty large constant factor.

gfairchild · 2022-09-16T15:21:03Z

I honestly don't really have the time to maintain this library because I'm no longer using it in my research, so I'm admittedly not quite sure why performance is dropping off like this. At the time I wrote it, it was by far the most performant DL distance library for Python (and the only one that supported unicode), but this is clearly not the case anymore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reproducing benchmark results #32

reproducing benchmark results #32

maxbachmann commented Jan 6, 2022 •

edited

Loading

gfairchild commented Jan 6, 2022

maxbachmann commented Sep 16, 2022 •

edited

Loading

gfairchild commented Sep 16, 2022

reproducing benchmark results #32

reproducing benchmark results #32

Comments

maxbachmann commented Jan 6, 2022 • edited Loading

gfairchild commented Jan 6, 2022

maxbachmann commented Sep 16, 2022 • edited Loading

gfairchild commented Sep 16, 2022

maxbachmann commented Jan 6, 2022 •

edited

Loading

maxbachmann commented Sep 16, 2022 •

edited

Loading