Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicates in playerid_lookup fuzzy search #373

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mhmills
Copy link

@mhmills mhmills commented Jul 27, 2023

As mentioned in #358, if you were to search playerid_lookup("tatis", "fernando", fuzzy=True) right now, you would get duplicate rows for Fernando Tatís Jr and Sr. This is because fuzzy=True and the search doesn't produce an exact match because the correct name is Tatís with the accented í, not Tatis. Since the Chadwick names for Tatís Jr and Sr are the same, 'Fernando Tatís' is 2/5 names in fuzzy_matches when the merge is done with the player table in get_closest_names(). Each copy of the name matches with the table data for Tatís Jr and Sr, so we get duplicates for each.

The change I made was to drop the duplicate name before the merge (making the length of fuzzy_matches 4 not 5), so now the single copy of the name can match data for both Jr and Sr. Since the one copy of the name matches data for both players, we still end up returning 5 players after the merge as expected. The same effect can be seen if you were to do a fuzzy search for Vladimir Guerrero Jr and Sr, such as playerid_lookup("guerrero", "vladimi", fuzzy=True).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant