Skip to content

Conversation

@sabasiddique1
Copy link

@sabasiddique1 sabasiddique1 commented Jan 26, 2026

REF: #1847
Goal
Allow users to choose the distance metric used for nearest‑neighbor matching in fuzzy_join / Joiner, enabling alternatives like cosine or manhattan for different data types.

Context / Issue
Addresses enhancement request to support non‑Euclidean distance metrics for fuzzy join nearest neighbors.

What changed (step‑by‑step)

  1. Added a metric parameter to fuzzy_join and Joiner with default 'euclidean'.
  2. Passed the metric through the matching classes to sklearn.neighbors.NearestNeighbors.
  3. Documented the new parameter in docstrings with common metric examples.
  4. Added tests covering multiple metrics and invalid metric handling.

Tests

  • pytest skrub/tests/test_fuzzy_join.py -v

@MarieSacksick
Copy link
Contributor

hello @sabasiddique1!
Welcome :).

The CI doesn't pass, could you fix it please? It looks like it's because of the styling, it's explained how to deal with it here in the contribution guide.
In the mean time, you can put your PR in draft mode, so that it's clear that it's ongoing work. You can do it at the top right of the PR menu:
image

Comment on lines +541 to +543
# Test that invalid metric raises error
with pytest.raises(ValueError):
fuzzy_join(left, right, on="A", metric='invalid_metric')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in a stand-alone test, to keep tests unitary and dedicated to a specific purpose.


c = fuzzy_join(b, a, left_on="col3", right_on="col1", add_match_info=True)
assert ns.shape(c)[0] == len(b)
def test_fuzzy_join_distance_metrics(df_module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of repeating the code, you can parameterize it, for instance as it's done here or here.
Also, we want to test the values and not only the shape, to make sure that the input "metric" is taken into account, and that the function doesn't always use the default one.

left, right, on="A", suffix="r", metric='euclidean', add_match_info=False
)
assert ns.shape(result_euclidean)[0] == 2
assert ns.shape(result_euclidean)[1] == 3 # A, Ar, Br
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment # A, Ar, Br is unclear to me, what do you mean here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants