Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzzy dedup - minhash buckets and jaccard_map_buckets #430

Open
ms-leemina opened this issue Dec 13, 2024 · 1 comment
Open

Fuzzy dedup - minhash buckets and jaccard_map_buckets #430

ms-leemina opened this issue Dec 13, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@ms-leemina
Copy link

Describe the bug
Somehow the output of the minhash_buckets cannot be processed by jaccard_map_buckets from the Docker version of nvcr.io/nvidia/nemo:24.12-rc0

The other versions work fine. However, the older versions have exact dedup bug.

################ logs below
Fourth step: jaccard_map_buckets
cuDF Spilling is enabled
Num Workers = 4
Connected to dask cluster
Running jaccard map buckets script
Args = Namespace(device='gpu', files_per_partition=2, n_workers=96, num_files=None, nvlink_only=False, protocol='tcp', rmm_pool_size=None, scheduler_address=None, scheduler_file=None, threads_per_worker=1, input_data_dirs=['data/interim_output'], input_json_text_field='content', input_json_id_field='nemo_id', log_dir='./', profile_path=None, input_meta=None, output_dir='data/interim_output/dedupe', text_ddf_blocksize=256, input_bucket_dir='data/interim_output/buckets/_buckets.parquet', input_bucket_field='_bucket_id', shuffle_type='tasks', set_torch_to_use_rmm=False)
Number of files being read for jaccard calculation = 4
Number of ddf_bk partitions = 4
/usr/local/lib/python3.10/dist-packages/dask/dataframe/multi.py:521: UserWarning: Merging dataframes with merge column data type mismatches:
+------------------------------+------------+-------------+
| Merge columns | left dtype | right dtype |
+------------------------------+------------+-------------+
| ('_bucket_id', '_bucket_id') | object | uint64 |
+------------------------------+------------+-------------+
Cast dtypes explicitly to avoid unexpected results.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/bin/jaccard_map_buckets", line 8, in
sys.exit(console_script())
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 187, in console_script
main(attach_args().parse_args())
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 169, in main
jaccard_get_output_map_workflow(
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 138, in jaccard_get_output_map_workflow
ddf_anchor_docs_with_bk = get_anchor_and_output_map_info(
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 68, in get_anchor_and_output_map_info
ddf_anchor_docs_with_bk = map_buckets.map_buckets_with_anchors(
File "/opt/NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py", line 1067, in map_buckets_with_anchors
ddf_anchor_docs_with_bk = ddf_anchor_docs_with_bk.merge(
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_collection.py", line 2937, in merge
return merge(
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_collection.py", line 5711, in merge
result = new_collection(
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_collection.py", line 4799, in new_collection
meta = expr._meta
File "/usr/lib/python3.10/functools.py", line 981, in get
val = self.func(instance)
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_merge.py", line 196, in _meta
return make_meta(left.merge(right, **kwargs))
File "/usr/local/lib/python3.10/dist-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/dataframe.py", line 4299, in merge
).perform_merge()
File "/usr/local/lib/python3.10/dist-packages/cudf/core/join/join.py", line 265, in perform_merge
lcol_casted, rcol_casted = _match_join_keys(lcol, rcol, self.how)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/join/_join_helpers.py", line 118, in _match_join_keys
return lcol.astype(common_type), rcol.astype(common_type)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 1028, in astype
result = self.as_numerical_column(dtype)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/string.py", line 5732, in as_numerical_column
raise ValueError(
ValueError: Could not convert strings to float type due to presence of non-floating values.

Steps/Code to reproduce bug

Run the steps in the document.

   # Step 2: gpu_compute_minhashes  
   echo "Second step: gpu_compute_minhashes"  
   gpu_compute_minhashes --input-data-dirs "${INTERIM_OUTPUT_DATA_DIR}" \
                        --output-minhash-dir "${OUTPUT_MINHASH_DIR}" \
                        --input-json-text-field "${INPUT_JSON_TEXT_FIELD}" \
                        --input-json-id-field "${INPUT_JSON_ID_FIELD}" \
                        --minhash-length "${MINHASH_LENGTH}" \
                        --char-ngram "${CHAR_NGRAM}" \
                        --hash-bytes "${HASH_BYTES}" \
                        --seed "${SEED}" \
                        --files-per-partition "${FILES_PER_PARTITION}" \
                        --log-dir "${LOG_DIR}"  
   
   echo "###################################################"  
   echo "Contents of minhash_output_dir"  
   find "${OUTPUT_MINHASH_DIR}" | awk '{print substr($0, 1, length($0)-length($NF)) "|-- " $NF}'
   echo "###################################################"  

   # Step 3: minhash_buckets  
   echo "Third step: minhash_buckets"  
   minhash_buckets --input-data-dirs "${OUTPUT_MINHASH_DIR}" \
                 --output-bucket-dir "${OUTPUT_BUCKET_DIR}" \
                 --input-minhash-field "${INPUT_MINHASH_FIELD}" \
                 --input-json-id-field "${INPUT_JSON_ID_FIELD}" \
                 --minhash-length "${MINHASH_LENGTH}" \
                 --num-bands "${NUM_BANDS}" \
                 --buckets-per-shuffle "${BUCKETS_PER_SHUFFLE}" \
                 --log-dir "${LOG_DIR}"  
   
   echo "###################################################"  
   echo "Contents of minhash_buckets_output_dir"  
   ls -l "${OUTPUT_BUCKET_DIR}"  
   echo "###################################################"  

   echo "False positive check: ${FALSE_POSITIVE_CHECK}"  
   if [ "${FALSE_POSITIVE_CHECK}" == "True" ]; then  
          # Step 4: jaccard_map_buckets  
          echo "Fourth step: jaccard_map_buckets"  
          jaccard_map_buckets --input-data-dirs "${INTERIM_OUTPUT_DATA_DIR}" \
                               --input-bucket-dir "${OUTPUT_BUCKET_DIR}/_buckets.parquet" \
                               --output-dir "${OUTPUT_DEDUPE_DIR}" \
                               --text-ddf-blocksize "${TEXT_DDF_BLOCKSIZE}" \
                               --input-json-text-field "${INPUT_JSON_TEXT_FIELD}" \
                               --input-json-id-field "${INPUT_JSON_ID_FIELD}" \
                               --log-dir "${LOG_DIR}"  

Expected behavior

Environment overview (please complete the following information)

Docker version of nvcr.io/nvidia/nemo:24.12-rc0
Using a GPU machine
docker run "--gpus", "all", "--ipc=host", "--ulimit", "stack=67108864"

@ms-leemina ms-leemina added the bug Something isn't working label Dec 13, 2024
@ayushdg
Copy link
Collaborator

ayushdg commented Dec 18, 2024

Thanks for raising @ms-leemina.
Following changes made in #326, the LSH step now requires running with the --false-positive-check flag. I'll update some of the documentation to reflect this.

   minhash_buckets --input-data-dirs "${OUTPUT_MINHASH_DIR}" \
                 --output-bucket-dir "${OUTPUT_BUCKET_DIR}" \
                 --input-minhash-field "${INPUT_MINHASH_FIELD}" \
                 --input-json-id-field "${INPUT_JSON_ID_FIELD}" \
                 --minhash-length "${MINHASH_LENGTH}" \
                 --num-bands "${NUM_BANDS}" \
                 --buckets-per-shuffle "${BUCKETS_PER_SHUFFLE}" \
                 --log-dir "${LOG_DIR}" \
                 --false-positive-check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants