Refactored static and dynamic enrichment APIs #336

mturk24 · 2024-10-28T20:50:49Z

Refactored client side code based on this task: https://www.notion.so/cleanlab/make-data-enrichment-client-side-API-match-backend-API-105c7fee85be8097b54dfb121b7dba4e

Goal: make Dynamic API match the Static API.

in all facets: cohesive naming of methods, argument types, regular expression libraries, etc.
This way user can use backend API to prototype Enrichment jobs and run them over a big static dataset, but then use client-API when they need to run the same logic in real-time over streaming data one at a time.
For any packages we need to import client-side, make these lazy optional imports, so cleanlab-studio package still works without those installed.

This will contain:

User code if they just want to do some real-time data enrichment quickly.
User code if they want to first run data enrichment project over a big static dataset, and then later want to run some real-time data enrichment over additional data.

Still doing some testing (unit test runs) but will request review on overall structure.

huiwengoh · 2024-10-29T20:36:58Z

cleanlab_studio/utils/data_enrichment/enrich.py

+    pd = _get_pandas()
+    tqdm = _get_tqdm()


Is there a reason these are not just being imported at the top of the file?

huiwengoh · 2024-10-29T20:39:22Z

cleanlab_studio/studio/enrichment.py

+        Returns:
+            Dict[str, Any]: A dictionary containing information about the enrichment job and the enriched dataset.
+        """
+        run_online = _get_run_online()


Why is this not just imported at the top of the file?

huiwengoh · 2024-10-29T20:48:47Z

cleanlab_studio/studio/enrichment.py

+            Dict[str, Any]: A dictionary containing information about the enrichment job and the enriched dataset.
+        """
+        run_online = _get_run_online()
+        job_info = run_online(data, options, new_column_name, self._api_key)


I don't think passing in self._api_key because run_online expects a Studio object?

huiwengoh · 2024-10-29T20:59:26Z

cleanlab_studio/studio/enrichment.py

+
+        Args:
+            data (Union[pd.DataFrame, List[dict]]): The dataset to enrich.
+            options (EnrichmentOptions): Options for enriching the dataset.


Link to EnrichmentOptions docstring

huiwengoh · 2024-10-29T21:04:22Z

cleanlab_studio/utils/data_enrichment/enrich.py

+    return job_info
+
+
+def _validate_enrichment_options(options: EnrichmentOptions) -> None:


Can you clarify why there is a separate _validate_enrichment_options defined here rather than using the validation function in run() here?

huiwengoh · 2024-10-29T21:06:25Z

cleanlab_studio/utils/data_enrichment/enrich.py

-    regex: Union[str, Replacement, List[Replacement]],
-) -> Union[pd.Series, List[str]]:
+    column_data: Union["pd.Series", List[str]],
+    regex: Union[str, Tuple[str, str], List[Tuple[str, str]]],


[nit] Replacement is a type alias for the Tuple[str, str] type (ref here), not entirely sure why you made this change?

jwmueller · 2024-10-31T19:15:20Z

cleanlab_studio/utils/data_enrichment/enrich.py

+@lru_cache(maxsize=None)
+def _get_pandas():
+    import pandas as pd
+
+    return pd


what is going on here?

pandas is already a dependency of this package, there should be no special logic to lazy-import it

cleanlab-studio/setup.py

Line 52 in c2a3013

"pandas==2.*",

jwmueller · 2024-10-31T19:15:57Z

cleanlab_studio/utils/data_enrichment/enrich.py

+@lru_cache(maxsize=None)
+def _get_tqdm():
+    from tqdm import tqdm
+
+    return tqdm
+


what is going on here?

tqdm is already a dependency of this package, there should be no special logic to lazy import it

cleanlab-studio/setup.py

Line 57 in c2a3013

"tqdm>=4.64.0",

jwmueller · 2024-10-31T19:17:10Z

cleanlab_studio/studio/enrichment.py

+@lru_cache(maxsize=None)
+def _get_run_online():
+    from cleanlab_studio.utils.data_enrichment.enrich import run_online
+
+    return run_online
+


get rid of this and do a standard import, unsure why you are using such an odd approach

cleanlab_studio/studio/enrichment.py

jwmueller · 2024-10-31T19:20:39Z

cleanlab_studio/studio/enrichment.py

@@ -17,6 +17,7 @@
 import pandas as pd


please update the PR description with:

User code if they just want to do some real-time data enrichment quickly.

User code if they want to first run data enrichment project over a big static dataset, and then later want to run some real-time data enrichment over additional data.

Co-authored-by: Jonas Mueller <[email protected]>

Refactored static and dynamic enrichment APIs

4e0db6a

mturk24 requested a review from huiwengoh October 28, 2024 20:50

mturk24 added 2 commits October 29, 2024 10:32

Reformatted and renamed client side enrich script

1cdc0dd

Reformatted static API and renamed new method

5015254

huiwengoh reviewed Oct 29, 2024

View reviewed changes

jwmueller reviewed Oct 31, 2024

View reviewed changes

cleanlab_studio/studio/enrichment.py Outdated Show resolved Hide resolved

jwmueller reviewed Oct 31, 2024

View reviewed changes

mturk24 and others added 3 commits November 1, 2024 10:43

Update cleanlab_studio/studio/enrichment.py

35526a9

Co-authored-by: Jonas Mueller <[email protected]>

Making adjustments based on PR comments

d07c74b

Removed unnecessary line in process_regex

f2754e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored static and dynamic enrichment APIs #336

Refactored static and dynamic enrichment APIs #336

mturk24 commented Oct 28, 2024 •

edited

Loading

huiwengoh Oct 29, 2024

huiwengoh Oct 29, 2024

huiwengoh Oct 29, 2024

huiwengoh Oct 29, 2024

huiwengoh Oct 29, 2024

huiwengoh Oct 29, 2024

jwmueller Oct 31, 2024

jwmueller Oct 31, 2024

jwmueller Oct 31, 2024 •

edited

Loading

jwmueller Oct 31, 2024

		return job_info


		def _validate_enrichment_options(options: EnrichmentOptions) -> None:

Refactored static and dynamic enrichment APIs #336

Are you sure you want to change the base?

Refactored static and dynamic enrichment APIs #336

Conversation

mturk24 commented Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwmueller Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mturk24 commented Oct 28, 2024 •

edited

Loading

jwmueller Oct 31, 2024 •

edited

Loading