Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASR Lhoste dataloader : TypeError: object of type 'IterableDatasetWrapper' has no len() #12093

Open
AudranBert opened this issue Feb 7, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@AudranBert
Copy link

AudranBert commented Feb 7, 2025

Describe the bug

Can't use tarred_audio_filepaths with use_lhoste: true

[NeMo I 2025-02-07 15:43:40 nemo_logging:393] Creating a Lhotse DynamicBucketingSampler (max_batch_duration=1100.0 max_batch_size=1)
Error executing job with overrides: []
Traceback (most recent call last):
  File "train.py", line 31, in main
    asr_model = setup_dataloaders(asr_model, cfg)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "train_utils.py", line 150, in setup_dataloaders
    asr_model.setup_training_data(cfg.model.train_ds)
  File "NeMo/nemo/collections/asr/models/ctc_models.py", line 419, in setup_training_data
    * ceil((len(self._train_dl.dataset) / self.world_size) / train_data_config['batch_size'])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: object of type 'IterableDatasetWrapper' has no len()

Steps/Code to reproduce bug

I'm based on speech_to_text_finetune.py and I sue the following config:

init_from_nemo_model: ".cache/huggingface/hub/models--nvidia--parakeet-ctc-0.6b/snapshots/16ca39445465932bfbaeb5126933d5ce8bd43a77/parakeet-ctc-0.6b.nemo" # path to nemo model
model:
  sample_rate: 16000
  compute_eval_loss: true   # eval samples can be very long and exhaust memory. Disable computation of transducer loss during validation/testing with this flag.
  log_prediction: false     # enables logging sample predictions in the output during training
  ctc_reduction: 'mean_volume'
  skip_nan_grad: false
  seed: 42
  train_ds:
    sample_rate: ${model.sample_rate}
    batch_size: 1 # must be 1 if using bucketing_batch_size
    shuffle: true
    pin_memory: true
    max_duration: 30.1
    min_duration: 0.1
    shuffle_n: 2048
    num_workers: 8
    manifest_filepath:
    - - bucket1/sharded_manifests/manifest__OP_0..23_CL_.json
    - - bucket2/sharded_manifests/manifest__OP_0..23_CL_.json
    tarred_audio_filepaths:
    - - bucket1/audio__OP_0..23_CL_.tar
    - - bucket2/audio__OP_0..23_CL_.tar
    use_lhotse: true
    batch_duration: 1100
    quadratic_duration: 30
    num_buckets: 6
    num_cuts_for_bins_estimate: 10000
    bucket_buffer_size: 10000
    shuffle_buffer_size: 10000
    use_bucketing: true

    # tarred datasets
    is_tarred: false

Environment overview (please complete the following information)

  • Installed with pip install nemo_toolkit['asr']: v2.2.0rc2

Environment details

  • PyTorch 2.5.1+cu121
  • Python 3.11

Additional context

GPU: A100

@AudranBert AudranBert added the bug Something isn't working label Feb 7, 2025
@AudranBert
Copy link
Author

Hi @pzelasko , Do you any clue? I followed your recommendations from #10084, but I can't make it work. Is there any example of finetuning using lhoste with tarred datasets?

I also tried:

    manifest_filepath: tarred_datasets/sharded_manifests/manifest__OP_0..95_CL_.json
    tarred_audio_filepaths: tarred_datasets/audio__OP_0..95_CL_.tar
    use_lhotse: true
    batch_duration: 1100
    quadratic_duration: 30
    num_buckets: 30
    num_cuts_for_bins_estimate: 10000
    bucket_buffer_size: 10000
    shuffle_buffer_size: 10000
    use_bucketing: true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant