Remove duplicates in playerid_lookup fuzzy search #373

mhmills · 2023-07-27T18:58:26Z

As mentioned in #358, if you were to search playerid_lookup("tatis", "fernando", fuzzy=True) right now, you would get duplicate rows for Fernando Tatís Jr and Sr. This is because fuzzy=True and the search doesn't produce an exact match because the correct name is Tatís with the accented í, not Tatis. Since the Chadwick names for Tatís Jr and Sr are the same, 'Fernando Tatís' is 2/5 names in fuzzy_matches when the merge is done with the player table in get_closest_names(). Each copy of the name matches with the table data for Tatís Jr and Sr, so we get duplicates for each.

The change I made was to drop the duplicate name before the merge (making the length of fuzzy_matches 4 not 5), so now the single copy of the name can match data for both Jr and Sr. Since the one copy of the name matches data for both players, we still end up returning 5 players after the merge as expected. The same effect can be seen if you were to do a fuzzy search for Vladimir Guerrero Jr and Sr, such as playerid_lookup("guerrero", "vladimi", fuzzy=True).

mhmills added 2 commits July 26, 2023 20:52

Remove Duplicates in playerid_lookup fuzzy

e602fbb

Merge branch 'master' of https://github.com/jldbc/pybaseball

617f65c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove duplicates in playerid_lookup fuzzy search #373

Remove duplicates in playerid_lookup fuzzy search #373

mhmills commented Jul 27, 2023

Remove duplicates in playerid_lookup fuzzy search #373

Are you sure you want to change the base?

Remove duplicates in playerid_lookup fuzzy search #373

Conversation

mhmills commented Jul 27, 2023