Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llava-like dataset implementation "LazySupervisedDataset" likely fails to handle large dataset #12034

Open
bernardhan33 opened this issue Feb 3, 2025 · 1 comment
Assignees

Comments

@bernardhan33
Copy link

Is your feature request related to a problem? Please describe.

The llava-like dataset implementation vlm.NevaLazyDataModule with the json file will initialize the dataset with json.load() here.

The JSON file will be huge with a much larger dataset. With 558K images and their captions, the JSON file is 172 MB. With 133K images and their captions, the JSON file is 42 MB. This indicates a linear growth of the JSON file with the number of images (and their conversations captions)

We'd like to test out with a 128M image dataset, with this calculation, the JSON file will be about 40 GB large so will create a large stress to the CRAM when the line json.load() executes.

This problem will get worse when PyTorch DataLoader replicates the initialized Python list across the processes with this known PyTorch issue.

Describe the solution you'd like
Are there considerations of using a different format to store the conversatoins captions? Storing the conversations along with the images as individual files can help mitigate this issue.

If the JSON file is a must-have, are there considerations of using packages such as ijson to stream or lazy load the JSON file?

Describe alternatives you've considered

I do notice a "jsonl" option here though I haven't tried it out.

@yaoyu-33
Copy link
Collaborator

In latest main we named this lazy to be "preloaded" dataset. Large dataset will be its limitation. But we do support megatron-energon dataset as well. Which's built based on webdataset.
You can easily do the conversion by https://docs.nvidia.com/nemo-framework/user-guide/latest/vlms/energondataprep.html . It will load each sample on the fly instead of load it all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants