Fuzzy dedup - minhash buckets and jaccard_map_buckets #430

ms-leemina · 2024-12-13T20:59:40Z

Describe the bug
Somehow the output of the minhash_buckets cannot be processed by jaccard_map_buckets from the Docker version of nvcr.io/nvidia/nemo:24.12-rc0

The other versions work fine. However, the older versions have exact dedup bug.

################ logs below
Fourth step: jaccard_map_buckets
cuDF Spilling is enabled
Num Workers = 4
Connected to dask cluster
Running jaccard map buckets script
Args = Namespace(device='gpu', files_per_partition=2, n_workers=96, num_files=None, nvlink_only=False, protocol='tcp', rmm_pool_size=None, scheduler_address=None, scheduler_file=None, threads_per_worker=1, input_data_dirs=['data/interim_output'], input_json_text_field='content', input_json_id_field='nemo_id', log_dir='./', profile_path=None, input_meta=None, output_dir='data/interim_output/dedupe', text_ddf_blocksize=256, input_bucket_dir='data/interim_output/buckets/_buckets.parquet', input_bucket_field='_bucket_id', shuffle_type='tasks', set_torch_to_use_rmm=False)
Number of files being read for jaccard calculation = 4
Number of ddf_bk partitions = 4
/usr/local/lib/python3.10/dist-packages/dask/dataframe/multi.py:521: UserWarning: Merging dataframes with merge column data type mismatches:
+------------------------------+------------+-------------+
| Merge columns | left dtype | right dtype |
+------------------------------+------------+-------------+
| ('_bucket_id', '_bucket_id') | object | uint64 |
+------------------------------+------------+-------------+
Cast dtypes explicitly to avoid unexpected results.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/bin/jaccard_map_buckets", line 8, in
sys.exit(console_script())
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 187, in console_script
main(attach_args().parse_args())
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 169, in main
jaccard_get_output_map_workflow(
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 138, in jaccard_get_output_map_workflow
ddf_anchor_docs_with_bk = get_anchor_and_output_map_info(
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 68, in get_anchor_and_output_map_info
ddf_anchor_docs_with_bk = map_buckets.map_buckets_with_anchors(
File "/opt/NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py", line 1067, in map_buckets_with_anchors
ddf_anchor_docs_with_bk = ddf_anchor_docs_with_bk.merge(
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_collection.py", line 2937, in merge
return merge(
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_collection.py", line 5711, in merge
result = new_collection(
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_collection.py", line 4799, in new_collection
meta = expr._meta
File "/usr/lib/python3.10/functools.py", line 981, in get
val = self.func(instance)
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_merge.py", line 196, in _meta
return make_meta(left.merge(right, **kwargs))
File "/usr/local/lib/python3.10/dist-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/dataframe.py", line 4299, in merge
).perform_merge()
File "/usr/local/lib/python3.10/dist-packages/cudf/core/join/join.py", line 265, in perform_merge
lcol_casted, rcol_casted = _match_join_keys(lcol, rcol, self.how)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/join/_join_helpers.py", line 118, in _match_join_keys
return lcol.astype(common_type), rcol.astype(common_type)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 1028, in astype
result = self.as_numerical_column(dtype)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/string.py", line 5732, in as_numerical_column
raise ValueError(
ValueError: Could not convert strings to float type due to presence of non-floating values.

Steps/Code to reproduce bug

Run the steps in the document.

   # Step 2: gpu_compute_minhashes  
   echo "Second step: gpu_compute_minhashes"  
   gpu_compute_minhashes --input-data-dirs "${INTERIM_OUTPUT_DATA_DIR}" \
                        --output-minhash-dir "${OUTPUT_MINHASH_DIR}" \
                        --input-json-text-field "${INPUT_JSON_TEXT_FIELD}" \
                        --input-json-id-field "${INPUT_JSON_ID_FIELD}" \
                        --minhash-length "${MINHASH_LENGTH}" \
                        --char-ngram "${CHAR_NGRAM}" \
                        --hash-bytes "${HASH_BYTES}" \
                        --seed "${SEED}" \
                        --files-per-partition "${FILES_PER_PARTITION}" \
                        --log-dir "${LOG_DIR}"  
   
   echo "###################################################"  
   echo "Contents of minhash_output_dir"  
   find "${OUTPUT_MINHASH_DIR}" | awk '{print substr($0, 1, length($0)-length($NF)) "|-- " $NF}'
   echo "###################################################"  

   # Step 3: minhash_buckets  
   echo "Third step: minhash_buckets"  
   minhash_buckets --input-data-dirs "${OUTPUT_MINHASH_DIR}" \
                 --output-bucket-dir "${OUTPUT_BUCKET_DIR}" \
                 --input-minhash-field "${INPUT_MINHASH_FIELD}" \
                 --input-json-id-field "${INPUT_JSON_ID_FIELD}" \
                 --minhash-length "${MINHASH_LENGTH}" \
                 --num-bands "${NUM_BANDS}" \
                 --buckets-per-shuffle "${BUCKETS_PER_SHUFFLE}" \
                 --log-dir "${LOG_DIR}"  
   
   echo "###################################################"  
   echo "Contents of minhash_buckets_output_dir"  
   ls -l "${OUTPUT_BUCKET_DIR}"  
   echo "###################################################"  

   echo "False positive check: ${FALSE_POSITIVE_CHECK}"  
   if [ "${FALSE_POSITIVE_CHECK}" == "True" ]; then  
          # Step 4: jaccard_map_buckets  
          echo "Fourth step: jaccard_map_buckets"  
          jaccard_map_buckets --input-data-dirs "${INTERIM_OUTPUT_DATA_DIR}" \
                               --input-bucket-dir "${OUTPUT_BUCKET_DIR}/_buckets.parquet" \
                               --output-dir "${OUTPUT_DEDUPE_DIR}" \
                               --text-ddf-blocksize "${TEXT_DDF_BLOCKSIZE}" \
                               --input-json-text-field "${INPUT_JSON_TEXT_FIELD}" \
                               --input-json-id-field "${INPUT_JSON_ID_FIELD}" \
                               --log-dir "${LOG_DIR}"

Expected behavior

Environment overview (please complete the following information)

Docker version of nvcr.io/nvidia/nemo:24.12-rc0
Using a GPU machine
docker run "--gpus", "all", "--ipc=host", "--ulimit", "stack=67108864"

The text was updated successfully, but these errors were encountered:

ayushdg · 2024-12-18T14:33:53Z

Thanks for raising @ms-leemina.
Following changes made in #326, the LSH step now requires running with the --false-positive-check flag. I'll update some of the documentation to reflect this.

   minhash_buckets --input-data-dirs "${OUTPUT_MINHASH_DIR}" \
                 --output-bucket-dir "${OUTPUT_BUCKET_DIR}" \
                 --input-minhash-field "${INPUT_MINHASH_FIELD}" \
                 --input-json-id-field "${INPUT_JSON_ID_FIELD}" \
                 --minhash-length "${MINHASH_LENGTH}" \
                 --num-bands "${NUM_BANDS}" \
                 --buckets-per-shuffle "${BUCKETS_PER_SHUFFLE}" \
                 --log-dir "${LOG_DIR}" \
                 --false-positive-check

ms-leemina added the bug Something isn't working label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuzzy dedup - minhash buckets and jaccard_map_buckets #430

Fuzzy dedup - minhash buckets and jaccard_map_buckets #430

ms-leemina commented Dec 13, 2024

ayushdg commented Dec 18, 2024 •

edited

Loading

Fuzzy dedup - minhash buckets and jaccard_map_buckets #430

Fuzzy dedup - minhash buckets and jaccard_map_buckets #430

Comments

ms-leemina commented Dec 13, 2024

ayushdg commented Dec 18, 2024 • edited Loading

ayushdg commented Dec 18, 2024 •

edited

Loading