You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While running Semantic Deduplication on text files, it starts semantic dedupe pipeline, but runs into IndexError: list index out of range
Error Log
GPU: 0, Part: 20: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4.57it/s]
2024-10-30 13:42:26,014 - distributed.utils_perf - WARNING - full garbage collections took 72% CPU time recently (threshold: 10%)
2024-10-30 13:42:26,060 - distributed.utils_perf - WARNING - full garbage collections took 65% CPU time recently (threshold: 10%)
2024-10-30 13:42:26,196 - distributed.worker - WARNING - Compute Failed
Key: ('read_single_partition-fused-toparquetdata-f053b8f0935a4edb94f161972e2f27a8', 2)
State: executing
Function: execute_task
args: ((<function Fused._execute_task at 0x7f2e10b56d40>, {'read_single_partition-fused-toparquetdata-f053b8f0935a4edb94f161972e2f27a8': ('toparquetdata-9d98c9d2f77890cf221ce0d97398b829', 2), ('toparquetdata-9d98c9d2f77890cf221ce0d97398b829', 2): (<dask.dataframe.io.parquet.core.ToParquetFunctionWrapper object at 0x7f2a9b3a6140>, ('reset_index-0ced2634f121de0dd6cc480baf637ca7', 2), (2,)), ('reset_index-0ced2634f121de0dd6cc480baf637ca7', 2): (<function apply at 0x7f2e663ec9d0>, <methodcaller: reset_index>, [('<crossfit.backend.torch.op.base.predictor object a-bd58448c0c3a7e2471c1d5ce629f4850', 2)], {'drop': True}), ('<crossfit.backend.torch.op.base.predictor object a-bd58448c0c3a7e2471c1d5ce629f4850', 2): (<function apply at 0x7f2e663ec9d0>, <function apply_and_enforce at 0x7f2e1139b910>, [('<crossfit.op.tokenize.tokenizer object at 0x7fdad0-ea35aded3a541ceda1ad391c99bb6e42', 2)], {'partition_info': {'number': 2, 'division': None}, '_func': <crossfit.backend.torch.op.base.Predictor object at
kwargs: {}
Exception: "IndexError('list index out of range')"
Traceback: ' File "/home/nemo_curator/lib/python3.10/site-packages/dask_expr/_expr.py", line 3758, in _execute_task\n return dask.core.get(graph, name)\n File "/home/nemo_curator/lib/python3.10/site-packages/dask/core.py", line 157, in get\n result = _execute_task(task, cache)\n File "/home/nemo_curator/lib/python3.10/site-packages/dask/core.py", line 127, in _execute_task\n return func(*(_execute_task(a, cache) for a in args))\n File "/home/nemo_curator/lib/python3.10/site-packages/dask/utils.py", line 78, in apply\n return func(*args, **kwargs)\n File "/home/nemo_curator/lib/python3.10/site-packages/dask/dataframe/core.py", line 7164, in apply_and_enforce\n df = func(*args, **kwargs)\n File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/base.py", line 96, in __call__\n output = self.call(data, *args, **kwargs)\n File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 155, in call\n input_ids, attention_mask = self.call_column(data[col])\n File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 120, in call_column\n tokenized_data = self.tokenize_strings(text).copy()\n File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 71, in tokenize_strings\n tokenized_data = tokenizer.batch_encode_plus(\n File "/home/nemo_curator/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3306, in batch_encode_plus\n return self._batch_encode_plus(\n File "/home/nemo_curator/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 562, in _batch_encode_plus\n for key in tokens_and_encodings[0][0].keys():\n'
2024-10-30 13:42:26,203 - distributed.utils_perf - WARNING - full garbage collections took 73% CPU time recently (threshold: 10%)
GPU: 0, Part: 19: 0%| | 0/1 [00:00<?, ?it/s]2024-10-30 13:42:26,302 - distributed.utils_perf - WARNING - full garbage collections took 65% CPU time recently (threshold: 10%)
Traceback (most recent call last):
File "/home/projects/chem-data-curation/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 283, in <module>
main()
File "/home/projects/chem-data-curation/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 265, in main
run_curation_pipeline(args, text_files, code_files)
File "/home/projects/chem-data-curation/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 177, in run_curation_pipeline
semantic_dataset_text = semantic_dedupe(dataset=gpu_dataset_text, sem_dedupe_config_yaml_path=sem_dedupe_config_yaml_path, type='text')
File "/home/projects/chem-data-curation/NeMo-Curator/tutorials/dapt-curation/code/utils.py", line 354, in semantic_dedupe
duplicates = semdup(dataset)
File "/home/nemo_curator/lib/python3.10/site-packages/nemo_curator/modules/semantic_dedup.py", line 637, in __call__
embeddings_dataset = self.embedding_creator(dataset)
File "/home/nemo_curator/lib/python3.10/site-packages/nemo_curator/modules/semantic_dedup.py", line 215, in __call__
write_to_disk(
File "/home/nemo_curator/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 577, in write_to_disk
df.to_parquet(output_file_dir, write_index=False)
File "/home/nemo_curator/lib/python3.10/site-packages/dask_expr/_collection.py", line 3281, in to_parquet
return to_parquet(self, path, **kwargs)
File "/home/nemo_curator/lib/python3.10/site-packages/dask_expr/io/parquet.py", line 653, in to_parquet
out = out.compute(**compute_kwargs)
File "/home/nemo_curator/lib/python3.10/site-packages/dask_expr/_collection.py", line 476, in compute
return DaskMethodsMixin.compute(out, **kwargs)
File "/home/nemo_curator/lib/python3.10/site-packages/dask/base.py", line 376, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/home/nemo_curator/lib/python3.10/site-packages/dask/base.py", line 662, in compute
results = schedule(dsk, keys, **kwargs)
File "/home/nemo_curator/lib/python3.10/site-packages/dask_expr/_expr.py", line 3758, in _execute_task
return dask.core.get(graph, name)
File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/base.py", line 96, in __call__
output = self.call(data, *args, **kwargs)
File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 155, in call
input_ids, attention_mask = self.call_column(data[col])
File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 120, in call_column
tokenized_data = self.tokenize_strings(text).copy()
File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 71, in tokenize_strings
tokenized_data = tokenizer.batch_encode_plus(
File "/home/nemo_curator/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3306, in batch_encode_plus
return self._batch_encode_plus(
File "/home/nemo_curator/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 562, in _batch_encode_plus
for key in tokens_and_encodings[0][0].keys():
IndexError: list index out of range
Steps/Code to reproduce bug
Config for semantic dedupe
This was due to an empty partition and was fixed by
partition_lengths = ddf.map_partitions(len).compute()
non_empty_partitions = [i for i, length in enumerate(partition_lengths) if length > 0]
filtered_ddf = ddf.partitions[non_empty_partitions]
We should long term fix this in crossfit or NeMo Curator or at least fail loudly
VibhuJawa
changed the title
semantic_dedupe runs into IndexError: list index out of range
Empty Partitions lead to into IndexError: list index out of range in semantic_dedup runs
Jan 10, 2025
Describe the bug
While running Semantic Deduplication on text files, it starts semantic dedupe pipeline, but runs into
IndexError: list index out of range
Error Log
Steps/Code to reproduce bug
Config for semantic dedupe
Environment overview
Environment details
The text was updated successfully, but these errors were encountered: