New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fuzzy Dedup: Make skipping the False positive check the default #386

Open

ayushdg wants to merge 4 commits into NVIDIA:main from ayushdg:enh-fuzzydedup-nofpcheck-default

Collaborator

ayushdg commented Nov 21, 2024 •

edited

Loading

Description

This PR updates the FuzzyDedup modules making false_positive_check=False and char_ngrams=24 the default params going forward.

Motivation:

Internal testing and external research has shown that depending on the values of num_buckets and hashes_per_bucket, the false positive rate can be minimized based on intended jaccard similarity.
The defaults of 20 buckets and 13 hashes per bucket has a low false positive rate for documents with similarity < 0.8 and internal testing has shown a difference of 1-2% when skipping the fp check.
Increasing the default minhash length to 24 which roughly corresponds to 5 word shingles, results in removal rates which are very similar to 5 char ngrams with the fp check enabled since 5 char ngrams are prone to more false positives.
The false positive check is computationally expensive and has significantly higher memory requirements. It uses requires significantly higher cache in memory keeping an additional copy of the corpus in addition to minhashes and buckets.

These new defaults should produce similar results to the previous defaults while being significantly less compute intensive and faster.

Will update tutorials in a followup PR.

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

ayushdg added 3 commits

November 20, 2024 18:14


          Update the default to skip fp check during fuzzy dedup

9e4eeff

Signed-off-by: Ayush Dattagupta <[email protected]>


          Update cli scripts

71bfcd5

Signed-off-by: Ayush Dattagupta <[email protected]>


          update tests

89baf93

Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg marked this pull request as ready for review

November 22, 2024 18:59


          Merge branch 'main' into enh-fuzzydedup-nofpcheck-default

c7fd294

ayushdg added enhancement gpuci labels

sarahyurick reviewed

View reviewed changes

Collaborator

sarahyurick left a comment

I think making the default false_positive_check=False makes sense. Added some nits, otherwise are there any other files that should be updated? Maybe:

?

nemo_curator/scripts/fuzzy_deduplication/README.md

-              > [!TIP]
-              > When using these scripts in a multi-node environment (like Slurm, K8's etc.) it is recommended to start up a Dask cluster prior to execution and connect to the existing cluster via the `--scheduler-address` or `--scheduler-file` flag.
-              > Use the `--help` flag to view all possible CLI options for the scripts and details on what they do.
+              > The up to date documentation on running the Fuzzy Deduplication scripts can be found in the [NeMo-User guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html#id4). It is recommended to use the Python API directly rather than the CLI scripts for most cases.

Collaborator

sarahyurick Nov 25, 2024

Suggested change

      
            > The up to date documentation on running the Fuzzy Deduplication scripts can be found in the [NeMo-User guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html#id4). It is recommended to use the Python API directly rather than the CLI scripts for most cases.
          
            > The up to date documentation on running the fuzzy deduplication scripts can be found in the [NeMo Curator User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html#id4). It is recommended to use the Python API directly rather than the CLI scripts for most cases.

nemo_curator/modules/config.py

                   num_buckets: int = 20
                   hashes_per_bucket: int = 13
                   use_64_bit_hash: bool = False
                   buckets_per_shuffle: int = 1
-                  false_positive_check: bool = True
+                  false_positive_check: bool = False
                   # Only required for fp check

Collaborator

sarahyurick Nov 25, 2024

Suggested change

      
                # Only required for fp check
          
                # Only required for false positive check

Nit, but I prefer spelling out false_positive each time. Going to be making a similar comment about "fpp" on #285 but I think it can get pretty difficult to modify our codebase otherwise.

nemo_curator/modules/config.py

-                          raise ValueError(
-                              "Finding fuzzy duplicates requires a cache directory accessible via all workers to store intermediates"
-                          )
+                      fp_defaults = {

Collaborator

sarahyurick Nov 25, 2024

Suggested change

      
                    fp_defaults = {
          
                    false_positive_defaults = {

nemo_curator/modules/config.py

-                          warnings.warn(
-                              "Using a higher number of anchor docs might lead to higher memory footprint and might impact performance",
-                              category=UserWarning,
+                          for arg, default in fp_defaults.items():

Collaborator

sarahyurick Nov 25, 2024

Suggested change

      
                        for arg, default in fp_defaults.items():
          
                        for arg, default in false_positive_defaults.items():

nemo_curator/modules/config.py

+                                  "Using a small char_ngrams value might lead to a large number (~5%) of false positives during deduplication."
+                                  " Using a value of at least 20 for char_ngrams is recommended."
+                              )
+                          unused_fp_args = [

Collaborator

sarahyurick Nov 25, 2024

Suggested change

      
                        unused_fp_args = [
          
                        unused_false_positive_args = [

nemo_curator/modules/config.py

+                                  " Using a value of at least 20 for char_ngrams is recommended."
+                              )
+                          unused_fp_args = [
+                              arg for arg in fp_defaults.keys() if getattr(self, arg) is not None

Collaborator

sarahyurick Nov 25, 2024

Suggested change

      
                            arg for arg in fp_defaults.keys() if getattr(self, arg) is not None
          
                            arg for arg in false_positive_defaults.keys() if getattr(self, arg) is not None

nemo_curator/modules/config.py

+                          unused_fp_args = [
+                              arg for arg in fp_defaults.keys() if getattr(self, arg) is not None
+                          ]
+                          if unused_fp_args:

Collaborator

sarahyurick Nov 25, 2024

Suggested change

      
                        if unused_fp_args:
          
                        if unused_false_positive_args:

nemo_curator/modules/config.py

+                          ]
+                          if unused_fp_args:
+                              warnings.warn(
+                                  f"False positive check is disabled. Unused arguments {unused_fp_args} will be ignored",

Collaborator

sarahyurick Nov 25, 2024

Suggested change

      
                                f"False positive check is disabled. Unused arguments {unused_fp_args} will be ignored",
          
                                f"False positive check is disabled. Unused arguments {unused_false_positive_args} will be ignored",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement gpuci