-
Notifications
You must be signed in to change notification settings - Fork 200
Add configurable distance metric for fuzzy join nearest‑neighbor matching #1861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add configurable distance metric for fuzzy join nearest‑neighbor matching #1861
Conversation
|
hello @sabasiddique1! The CI doesn't pass, could you fix it please? It looks like it's because of the styling, it's explained how to deal with it here in the contribution guide. |
| # Test that invalid metric raises error | ||
| with pytest.raises(ValueError): | ||
| fuzzy_join(left, right, on="A", metric='invalid_metric') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in a stand-alone test, to keep tests unitary and dedicated to a specific purpose.
|
|
||
| c = fuzzy_join(b, a, left_on="col3", right_on="col1", add_match_info=True) | ||
| assert ns.shape(c)[0] == len(b) | ||
| def test_fuzzy_join_distance_metrics(df_module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| left, right, on="A", suffix="r", metric='euclidean', add_match_info=False | ||
| ) | ||
| assert ns.shape(result_euclidean)[0] == 2 | ||
| assert ns.shape(result_euclidean)[1] == 3 # A, Ar, Br |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment # A, Ar, Br is unclear to me, what do you mean here?

REF: #1847
Goal
Allow users to choose the distance metric used for nearest‑neighbor matching in
fuzzy_join/Joiner, enabling alternatives like cosine or manhattan for different data types.Context / Issue
Addresses enhancement request to support non‑Euclidean distance metrics for fuzzy join nearest neighbors.
What changed (step‑by‑step)
metricparameter tofuzzy_joinandJoinerwith default'euclidean'.metricthrough the matching classes tosklearn.neighbors.NearestNeighbors.Tests
pytest skrub/tests/test_fuzzy_join.py -v