llava-like dataset implementation "LazySupervisedDataset" likely fails to handle large dataset #12034

bernardhan33 · 2025-02-03T19:45:56Z

Is your feature request related to a problem? Please describe.

The llava-like dataset implementation vlm.NevaLazyDataModule with the json file will initialize the dataset with json.load() here.

The JSON file will be huge with a much larger dataset. With 558K images and their captions, the JSON file is 172 MB. With 133K images and their captions, the JSON file is 42 MB. This indicates a linear growth of the JSON file with the number of images (and their conversations captions)

We'd like to test out with a 128M image dataset, with this calculation, the JSON file will be about 40 GB large so will create a large stress to the CRAM when the line json.load() executes.

This problem will get worse when PyTorch DataLoader replicates the initialized Python list across the processes with this known PyTorch issue.

Describe the solution you'd like
Are there considerations of using a different format to store the conversatoins captions? Storing the conversations along with the images as individual files can help mitigate this issue.

If the JSON file is a must-have, are there considerations of using packages such as ijson to stream or lazy load the JSON file?

Describe alternatives you've considered

I do notice a "jsonl" option here though I haven't tried it out.

The text was updated successfully, but these errors were encountered:

yaoyu-33 · 2025-02-12T17:13:32Z

In latest main we named this lazy to be "preloaded" dataset. Large dataset will be its limitation. But we do support megatron-energon dataset as well. Which's built based on webdataset.
You can easily do the conversion by https://docs.nvidia.com/nemo-framework/user-guide/latest/vlms/energondataprep.html . It will load each sample on the fly instead of load it all.

bernardhan33 assigned okuchaiev Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llava-like dataset implementation "LazySupervisedDataset" likely fails to handle large dataset #12034

llava-like dataset implementation "LazySupervisedDataset" likely fails to handle large dataset #12034

bernardhan33 commented Feb 3, 2025

yaoyu-33 commented Feb 12, 2025

llava-like dataset implementation "LazySupervisedDataset" likely fails to handle large dataset #12034

llava-like dataset implementation "LazySupervisedDataset" likely fails to handle large dataset #12034

Comments

bernardhan33 commented Feb 3, 2025

yaoyu-33 commented Feb 12, 2025