Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored static and dynamic enrichment APIs #336

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

mturk24
Copy link
Contributor

@mturk24 mturk24 commented Oct 28, 2024

Refactored client side code based on this task: https://www.notion.so/cleanlab/make-data-enrichment-client-side-API-match-backend-API-105c7fee85be8097b54dfb121b7dba4e

Goal: make Dynamic API match the Static API.

in all facets: cohesive naming of methods, argument types, regular expression libraries, etc.
This way user can use backend API to prototype Enrichment jobs and run them over a big static dataset, but then use client-API when they need to run the same logic in real-time over streaming data one at a time.
For any packages we need to import client-side, make these lazy optional imports, so cleanlab-studio package still works without those installed.

This will contain:

  1. User code if they just want to do some real-time data enrichment quickly.

  2. User code if they want to first run data enrichment project over a big static dataset, and then later want to run some real-time data enrichment over additional data.

Still doing some testing (unit test runs) but will request review on overall structure.

@mturk24 mturk24 requested a review from huiwengoh October 28, 2024 20:50
Comment on lines 44 to 45
pd = _get_pandas()
tqdm = _get_tqdm()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these are not just being imported at the top of the file?

Returns:
Dict[str, Any]: A dictionary containing information about the enrichment job and the enriched dataset.
"""
run_online = _get_run_online()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not just imported at the top of the file?

Dict[str, Any]: A dictionary containing information about the enrichment job and the enriched dataset.
"""
run_online = _get_run_online()
job_info = run_online(data, options, new_column_name, self._api_key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think passing in self._api_key because run_online expects a Studio object?


Args:
data (Union[pd.DataFrame, List[dict]]): The dataset to enrich.
options (EnrichmentOptions): Options for enriching the dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to EnrichmentOptions docstring

return job_info


def _validate_enrichment_options(options: EnrichmentOptions) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify why there is a separate _validate_enrichment_options defined here rather than using the validation function in run() here?

regex: Union[str, Replacement, List[Replacement]],
) -> Union[pd.Series, List[str]]:
column_data: Union["pd.Series", List[str]],
regex: Union[str, Tuple[str, str], List[Tuple[str, str]]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Replacement is a type alias for the Tuple[str, str] type (ref here), not entirely sure why you made this change?

Comment on lines 12 to 16
@lru_cache(maxsize=None)
def _get_pandas():
import pandas as pd

return pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is going on here?

pandas is already a dependency of this package, there should be no special logic to lazy-import it

"pandas==2.*",

Comment on lines 19 to 24
@lru_cache(maxsize=None)
def _get_tqdm():
from tqdm import tqdm

return tqdm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is going on here?

tqdm is already a dependency of this package, there should be no special logic to lazy import it

"tqdm>=4.64.0",

Comment on lines +53 to +58
@lru_cache(maxsize=None)
def _get_run_online():
from cleanlab_studio.utils.data_enrichment.enrich import run_online

return run_online

Copy link
Member

@jwmueller jwmueller Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get rid of this and do a standard import, unsure why you are using such an odd approach

@@ -17,6 +17,7 @@
import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the PR description with:

  1. User code if they just want to do some real-time data enrichment quickly.

  2. User code if they want to first run data enrichment project over a big static dataset, and then later want to run some real-time data enrichment over additional data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants