Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN (or ArgumentError) from QGram distances for short strings #61

Open
robertfeldt opened this issue Dec 20, 2022 · 4 comments
Open

NaN (or ArgumentError) from QGram distances for short strings #61

robertfeldt opened this issue Dec 20, 2022 · 4 comments

Comments

@robertfeldt
Copy link
Collaborator

Some QGram distances today return NaN if one of the input strings are shorter than the q-gram length (q) while others don't:

julia> using StringDistances

julia> isnan(Cosine(2)("", "bb"))
true

julia> isnan(Cosine(2)("a", "bb"))
true

julia> Jaccard(2)("a", "bb")
1.0

julia> filter(d -> isnan(d(2)("", "bb")), [QGram, Cosine, Jaccard, Overlap, SorensenDice, MorisitaOverlap, NMD])
3-element Vector{DataType}:
 Cosine
 Overlap
 MorisitaOverlap

Maybe it is better to have a consistent behaviour for such inputs? Returning an ArgumentError might be better and then the caller has to decide how to handle such situations.

@robertfeldt
Copy link
Collaborator Author

Note that it goes even deeper since all but QGram returns NaN if both inputs are shorter than q:

julia> filter(d -> isnan(d(2)("", "")), [QGram, Cosine, Jaccard, Overlap, SorensenDice, MorisitaOverlap, NMD])
6-element Vector{DataType}:
 Cosine
 Jaccard
 Overlap
 SorensenDice
 MorisitaOverlap
 NMD

julia> QGram(1)("", "")
0

julia> QGram(2)("a", "b")
0

@matthieugomez
Copy link
Owner

matthieugomez commented Dec 22, 2022

I am not sure there is an issue with the current implementation. The way I think about it is that there is a formula for each Qgram distance (given in the docs), which is valid even when the set of qgrams is empty. In some distances, the length of qgrams appears in the denominator, which is why the distance returns NaN when the set of qgrams is empty.

@robertfeldt
Copy link
Collaborator Author

Mathematically I agree.

I see dangers in actual use but ok, people will have to handle it themselves. I guess a simple solution might be to just highlight somewhere in the documentation that one can add a safe evaluate method like so:

julia> function safeevaluate(D::Union{Cosine, Overlap, MorisitaOverlap}, s1, s2)
           length(s1) >= D.q && length(s2) >= D.q && return(evaluate(D, s1, s2))
           throw(ArgumentError("An argument is shorter than q ($(D.q)): \"$s1\", \"$s2\""))
       end
safeevaluate (generic function with 1 method)

julia> D = Cosine(2)
Cosine(2)

julia> @assert safeevaluate(D, "aa", "bb") == evaluate(D, "aa", "bb")

julia> safeevaluate(D, "", "bb")
ERROR: ArgumentError: An argument is shorter than q (2): "", "bb"
Stacktrace:
 [1] safeevaluate(D::Cosine, s1::String, s2::String)
   @ Main ./REPL[2]:3
 [2] top-level scope
   @ REPL[5]:1

julia> function safeevaluate(D::Union{Jaccard, SorensenDice, NMD}, s1, s2)
           (length(s1) >= D.q || length(s2) >= D.q) && return(evaluate(D, s1, s2))
           throw(ArgumentError("An argument is shorter than q ($(D.q)): \"$s1\", \"$s2\""))
       end
safeevaluate (generic function with 2 methods)

julia> safeevaluate(Jaccard(2), "", "")
ERROR: ArgumentError: An argument is shorter than q (2): "", ""
Stacktrace:
 [1] safeevaluate(D::Jaccard, s1::String, s2::String)
   @ Main ./REPL[11]:3
 [2] top-level scope
   @ REPL[12]:1

@matthieugomez
Copy link
Owner

Sure. I'm also open to change things — following what other libraries typically do in this case. I will leave this issue open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants