-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN (or ArgumentError) from QGram distances for short strings #61
Comments
Note that it goes even deeper since all but QGram returns NaN if both inputs are shorter than q: julia> filter(d -> isnan(d(2)("", "")), [QGram, Cosine, Jaccard, Overlap, SorensenDice, MorisitaOverlap, NMD])
6-element Vector{DataType}:
Cosine
Jaccard
Overlap
SorensenDice
MorisitaOverlap
NMD
julia> QGram(1)("", "")
0
julia> QGram(2)("a", "b")
0 |
I am not sure there is an issue with the current implementation. The way I think about it is that there is a formula for each Qgram distance (given in the docs), which is valid even when the set of qgrams is empty. In some distances, the length of qgrams appears in the denominator, which is why the distance returns NaN when the set of qgrams is empty. |
Mathematically I agree. I see dangers in actual use but ok, people will have to handle it themselves. I guess a simple solution might be to just highlight somewhere in the documentation that one can add a julia> function safeevaluate(D::Union{Cosine, Overlap, MorisitaOverlap}, s1, s2)
length(s1) >= D.q && length(s2) >= D.q && return(evaluate(D, s1, s2))
throw(ArgumentError("An argument is shorter than q ($(D.q)): \"$s1\", \"$s2\""))
end
safeevaluate (generic function with 1 method)
julia> D = Cosine(2)
Cosine(2)
julia> @assert safeevaluate(D, "aa", "bb") == evaluate(D, "aa", "bb")
julia> safeevaluate(D, "", "bb")
ERROR: ArgumentError: An argument is shorter than q (2): "", "bb"
Stacktrace:
[1] safeevaluate(D::Cosine, s1::String, s2::String)
@ Main ./REPL[2]:3
[2] top-level scope
@ REPL[5]:1
julia> function safeevaluate(D::Union{Jaccard, SorensenDice, NMD}, s1, s2)
(length(s1) >= D.q || length(s2) >= D.q) && return(evaluate(D, s1, s2))
throw(ArgumentError("An argument is shorter than q ($(D.q)): \"$s1\", \"$s2\""))
end
safeevaluate (generic function with 2 methods)
julia> safeevaluate(Jaccard(2), "", "")
ERROR: ArgumentError: An argument is shorter than q (2): "", ""
Stacktrace:
[1] safeevaluate(D::Jaccard, s1::String, s2::String)
@ Main ./REPL[11]:3
[2] top-level scope
@ REPL[12]:1 |
Sure. I'm also open to change things — following what other libraries typically do in this case. I will leave this issue open. |
Some QGram distances today return NaN if one of the input strings are shorter than the q-gram length (q) while others don't:
Maybe it is better to have a consistent behaviour for such inputs? Returning an ArgumentError might be better and then the caller has to decide how to handle such situations.
The text was updated successfully, but these errors were encountered: