-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactored static and dynamic enrichment APIs #336
base: main
Are you sure you want to change the base?
Conversation
pd = _get_pandas() | ||
tqdm = _get_tqdm() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason these are not just being imported at the top of the file?
cleanlab_studio/studio/enrichment.py
Outdated
Returns: | ||
Dict[str, Any]: A dictionary containing information about the enrichment job and the enriched dataset. | ||
""" | ||
run_online = _get_run_online() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not just imported at the top of the file?
cleanlab_studio/studio/enrichment.py
Outdated
Dict[str, Any]: A dictionary containing information about the enrichment job and the enriched dataset. | ||
""" | ||
run_online = _get_run_online() | ||
job_info = run_online(data, options, new_column_name, self._api_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think passing in self._api_key
because run_online
expects a Studio
object?
cleanlab_studio/studio/enrichment.py
Outdated
|
||
Args: | ||
data (Union[pd.DataFrame, List[dict]]): The dataset to enrich. | ||
options (EnrichmentOptions): Options for enriching the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to EnrichmentOptions
docstring
return job_info | ||
|
||
|
||
def _validate_enrichment_options(options: EnrichmentOptions) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify why there is a separate _validate_enrichment_options
defined here rather than using the validation function in run()
here?
regex: Union[str, Replacement, List[Replacement]], | ||
) -> Union[pd.Series, List[str]]: | ||
column_data: Union["pd.Series", List[str]], | ||
regex: Union[str, Tuple[str, str], List[Tuple[str, str]]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Replacement
is a type alias for the Tuple[str, str]
type (ref here), not entirely sure why you made this change?
@lru_cache(maxsize=None) | ||
def _get_pandas(): | ||
import pandas as pd | ||
|
||
return pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is going on here?
pandas is already a dependency of this package, there should be no special logic to lazy-import it
Line 52 in c2a3013
"pandas==2.*", |
@lru_cache(maxsize=None) | ||
def _get_tqdm(): | ||
from tqdm import tqdm | ||
|
||
return tqdm | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is going on here?
tqdm is already a dependency of this package, there should be no special logic to lazy import it
Line 57 in c2a3013
"tqdm>=4.64.0", |
@lru_cache(maxsize=None) | ||
def _get_run_online(): | ||
from cleanlab_studio.utils.data_enrichment.enrich import run_online | ||
|
||
return run_online | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get rid of this and do a standard import, unsure why you are using such an odd approach
@@ -17,6 +17,7 @@ | |||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update the PR description with:
-
User code if they just want to do some real-time data enrichment quickly.
-
User code if they want to first run data enrichment project over a big static dataset, and then later want to run some real-time data enrichment over additional data.
Refactored client side code based on this task: https://www.notion.so/cleanlab/make-data-enrichment-client-side-API-match-backend-API-105c7fee85be8097b54dfb121b7dba4e
Goal: make Dynamic API match the Static API.
in all facets: cohesive naming of methods, argument types, regular expression libraries, etc.
This way user can use backend API to prototype Enrichment jobs and run them over a big static dataset, but then use client-API when they need to run the same logic in real-time over streaming data one at a time.
For any packages we need to import client-side, make these lazy optional imports, so cleanlab-studio package still works without those installed.
This will contain:
User code if they just want to do some real-time data enrichment quickly.
User code if they want to first run data enrichment project over a big static dataset, and then later want to run some real-time data enrichment over additional data.
Still doing some testing (unit test runs) but will request review on overall structure.