Remove duplicates in playerid_lookup fuzzy search #373
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As mentioned in #358, if you were to search
playerid_lookup("tatis", "fernando", fuzzy=True)
right now, you would get duplicate rows for Fernando Tatís Jr and Sr. This is becausefuzzy=True
and the search doesn't produce an exact match because the correct name is Tatís with the accented í, not Tatis. Since the Chadwick names for Tatís Jr and Sr are the same, 'Fernando Tatís' is 2/5 names infuzzy_matches
when the merge is done with the player table inget_closest_names()
. Each copy of the name matches with the table data for Tatís Jr and Sr, so we get duplicates for each.The change I made was to drop the duplicate name before the merge (making the length of
fuzzy_matches
4 not 5), so now the single copy of the name can match data for both Jr and Sr. Since the one copy of the name matches data for both players, we still end up returning 5 players after the merge as expected. The same effect can be seen if you were to do a fuzzy search for Vladimir Guerrero Jr and Sr, such asplayerid_lookup("guerrero", "vladimi", fuzzy=True)
.