You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Somehow the output of the minhash_buckets cannot be processed by jaccard_map_buckets from the Docker version of nvcr.io/nvidia/nemo:24.12-rc0
The other versions work fine. However, the older versions have exact dedup bug.
################ logs below
Fourth step: jaccard_map_buckets
cuDF Spilling is enabled
Num Workers = 4
Connected to dask cluster
Running jaccard map buckets script
Args = Namespace(device='gpu', files_per_partition=2, n_workers=96, num_files=None, nvlink_only=False, protocol='tcp', rmm_pool_size=None, scheduler_address=None, scheduler_file=None, threads_per_worker=1, input_data_dirs=['data/interim_output'], input_json_text_field='content', input_json_id_field='nemo_id', log_dir='./', profile_path=None, input_meta=None, output_dir='data/interim_output/dedupe', text_ddf_blocksize=256, input_bucket_dir='data/interim_output/buckets/_buckets.parquet', input_bucket_field='_bucket_id', shuffle_type='tasks', set_torch_to_use_rmm=False)
Number of files being read for jaccard calculation = 4
Number of ddf_bk partitions = 4
/usr/local/lib/python3.10/dist-packages/dask/dataframe/multi.py:521: UserWarning: Merging dataframes with merge column data type mismatches:
+------------------------------+------------+-------------+
| Merge columns | left dtype | right dtype |
+------------------------------+------------+-------------+
| ('_bucket_id', '_bucket_id') | object | uint64 |
+------------------------------+------------+-------------+
Cast dtypes explicitly to avoid unexpected results.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/bin/jaccard_map_buckets", line 8, in
sys.exit(console_script())
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 187, in console_script
main(attach_args().parse_args())
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 169, in main
jaccard_get_output_map_workflow(
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 138, in jaccard_get_output_map_workflow
ddf_anchor_docs_with_bk = get_anchor_and_output_map_info(
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 68, in get_anchor_and_output_map_info
ddf_anchor_docs_with_bk = map_buckets.map_buckets_with_anchors(
File "/opt/NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py", line 1067, in map_buckets_with_anchors
ddf_anchor_docs_with_bk = ddf_anchor_docs_with_bk.merge(
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_collection.py", line 2937, in merge
return merge(
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_collection.py", line 5711, in merge
result = new_collection(
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_collection.py", line 4799, in new_collection
meta = expr._meta
File "/usr/lib/python3.10/functools.py", line 981, in get
val = self.func(instance)
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_merge.py", line 196, in _meta
return make_meta(left.merge(right, **kwargs))
File "/usr/local/lib/python3.10/dist-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/dataframe.py", line 4299, in merge
).perform_merge()
File "/usr/local/lib/python3.10/dist-packages/cudf/core/join/join.py", line 265, in perform_merge
lcol_casted, rcol_casted = _match_join_keys(lcol, rcol, self.how)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/join/_join_helpers.py", line 118, in _match_join_keys
return lcol.astype(common_type), rcol.astype(common_type)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 1028, in astype
result = self.as_numerical_column(dtype)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/string.py", line 5732, in as_numerical_column
raise ValueError(
ValueError: Could not convert strings to float type due to presence of non-floating values.
Thanks for raising @ms-leemina.
Following changes made in #326, the LSH step now requires running with the --false-positive-check flag. I'll update some of the documentation to reflect this.
Describe the bug
Somehow the output of the minhash_buckets cannot be processed by jaccard_map_buckets from the Docker version of nvcr.io/nvidia/nemo:24.12-rc0
The other versions work fine. However, the older versions have exact dedup bug.
################ logs below
Fourth step: jaccard_map_buckets
cuDF Spilling is enabled
Num Workers = 4
Connected to dask cluster
Running jaccard map buckets script
Args = Namespace(device='gpu', files_per_partition=2, n_workers=96, num_files=None, nvlink_only=False, protocol='tcp', rmm_pool_size=None, scheduler_address=None, scheduler_file=None, threads_per_worker=1, input_data_dirs=['data/interim_output'], input_json_text_field='content', input_json_id_field='nemo_id', log_dir='./', profile_path=None, input_meta=None, output_dir='data/interim_output/dedupe', text_ddf_blocksize=256, input_bucket_dir='data/interim_output/buckets/_buckets.parquet', input_bucket_field='_bucket_id', shuffle_type='tasks', set_torch_to_use_rmm=False)
Number of files being read for jaccard calculation = 4
Number of ddf_bk partitions = 4
/usr/local/lib/python3.10/dist-packages/dask/dataframe/multi.py:521: UserWarning: Merging dataframes with merge column data type mismatches:
+------------------------------+------------+-------------+
| Merge columns | left dtype | right dtype |
+------------------------------+------------+-------------+
| ('_bucket_id', '_bucket_id') | object | uint64 |
+------------------------------+------------+-------------+
Cast dtypes explicitly to avoid unexpected results.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/bin/jaccard_map_buckets", line 8, in
sys.exit(console_script())
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 187, in console_script
main(attach_args().parse_args())
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 169, in main
jaccard_get_output_map_workflow(
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 138, in jaccard_get_output_map_workflow
ddf_anchor_docs_with_bk = get_anchor_and_output_map_info(
File "/opt/NeMo-Curator/nemo_curator/scripts/fuzzy_deduplication/map_buckets.py", line 68, in get_anchor_and_output_map_info
ddf_anchor_docs_with_bk = map_buckets.map_buckets_with_anchors(
File "/opt/NeMo-Curator/nemo_curator/modules/fuzzy_dedup.py", line 1067, in map_buckets_with_anchors
ddf_anchor_docs_with_bk = ddf_anchor_docs_with_bk.merge(
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_collection.py", line 2937, in merge
return merge(
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_collection.py", line 5711, in merge
result = new_collection(
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_collection.py", line 4799, in new_collection
meta = expr._meta
File "/usr/lib/python3.10/functools.py", line 981, in get
val = self.func(instance)
File "/usr/local/lib/python3.10/dist-packages/dask_expr/_merge.py", line 196, in _meta
return make_meta(left.merge(right, **kwargs))
File "/usr/local/lib/python3.10/dist-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/dataframe.py", line 4299, in merge
).perform_merge()
File "/usr/local/lib/python3.10/dist-packages/cudf/core/join/join.py", line 265, in perform_merge
lcol_casted, rcol_casted = _match_join_keys(lcol, rcol, self.how)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/join/_join_helpers.py", line 118, in _match_join_keys
return lcol.astype(common_type), rcol.astype(common_type)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 1028, in astype
result = self.as_numerical_column(dtype)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/string.py", line 5732, in as_numerical_column
raise ValueError(
ValueError: Could not convert strings to float type due to presence of non-floating values.
Steps/Code to reproduce bug
Run the steps in the document.
Expected behavior
Environment overview (please complete the following information)
Docker version of nvcr.io/nvidia/nemo:24.12-rc0
Using a GPU machine
docker run "--gpus", "all", "--ipc=host", "--ulimit", "stack=67108864"
The text was updated successfully, but these errors were encountered: