-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mutating version of evaluate #20
Comments
As an example words1 = [join(rand('a':'z', rand(4:10))) for _ in 1:100]
words2 = [join(rand('a':'z', rand(4:10))) for _ in 1:100]
function f1(w1, w2)
dl = DamerauLevenshtein()
for word1 in w1, word2 in w2
evaluate(dl, word1, word2)
end
end
function f2(w1, w2)
dl = DamerauLevenshtein(10)
for word1 in w1, word2 in w2
evaluate!(dl, word1, word2)
end
end
@benchmark f1($words1, $words2)
BenchmarkTools.Trial:
memory estimate: 2.79 MiB
allocs estimate: 20000
--------------
minimum time: 2.785 ms (0.00% GC)
median time: 3.038 ms (0.00% GC)
mean time: 3.131 ms (2.09% GC)
maximum time: 5.036 ms (24.49% GC)
--------------
samples: 1596
evals/sample: 1
@benchmark f2($words1, $words2)
BenchmarkTools.Trial:
memory estimate: 352 bytes
allocs estimate: 3
--------------
minimum time: 2.006 ms (0.00% GC)
median time: 2.170 ms (0.00% GC)
mean time: 2.188 ms (0.00% GC)
maximum time: 3.021 ms (0.00% GC)
--------------
samples: 2283
evals/sample: 1 |
Interesting. If I understand correctly, |
The other possibility would be to create a vector of length 30 by default, say, and then |
So it looks like it should not be difficult to add this. The cost is that (i) it makes the code a bit more complicated (ii) it requires to create the distance outside the loop (iii) it may complicate stuff like multithreading. I'm not sure it outweighs the benefits. Are there cases where it's really important to avoid these allocations? |
10 was only an example, basically a It will add small overhead, which I think is negligible compared to the cost of mutable struct DL
v0::Vector{Int}
v2::Vector{Int}
DL(sizehint) = new(Vector{Int}(undef, sizehint), Vector{Int}(undef, sizehint))
DL() = new()
end
@btime DL()
4.560 ns (1 allocation: 32 bytes)
# as compared to current implementation
struct DL2 end
@btime DL2()
0.017 ns (0 allocations: 0 bytes) I've stumbled upon this, when I was implementing SymSpell fuzzy search, at some point I had to compare original phrase with lots of suggestions using DL algorithm and mutating version increased speed significantly. And overall changes are minimal, it's basically these three lines: https://github.com/Arkoniak/SymSpellChecker.jl/blob/master/src/distance.jl#L34-L36 |
When I was working with
DamerauLevenshtein
functionevaluate
I've noticed that it allocates a lot. Problem is with these lines: https://github.com/matthieugomez/StringDistances.jl/blob/master/src/edit.jl#L146-L147It doesn't look much when you compare only two words, but in my particular case, it was needed to compare thousands of words in a single run and these allocations became a real problem. I rewrote this algorithm by adding two mutable vectors that were reused between calculation and saw almost 80% drop in allocations and 10%-20% increase in speed. If it is interesting I can make a PR, which will add mutating version of this and
Levenshtein
evaluate
function.The text was updated successfully, but these errors were encountered: