You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The llava-like dataset implementation vlm.NevaLazyDataModule with the json file will initialize the dataset with json.load()here.
The JSON file will be huge with a much larger dataset. With 558K images and their captions, the JSON file is 172 MB. With 133K images and their captions, the JSON file is 42 MB. This indicates a linear growth of the JSON file with the number of images (and their conversations captions)
We'd like to test out with a 128M image dataset, with this calculation, the JSON file will be about 40 GB large so will create a large stress to the CRAM when the line json.load() executes.
This problem will get worse when PyTorch DataLoader replicates the initialized Python list across the processes with this known PyTorch issue.
Describe the solution you'd like
Are there considerations of using a different format to store the conversatoins captions? Storing the conversations along with the images as individual files can help mitigate this issue.
If the JSON file is a must-have, are there considerations of using packages such as ijson to stream or lazy load the JSON file?
Describe alternatives you've considered
I do notice a "jsonl" option here though I haven't tried it out.
The text was updated successfully, but these errors were encountered:
In latest main we named this lazy to be "preloaded" dataset. Large dataset will be its limitation. But we do support megatron-energon dataset as well. Which's built based on webdataset.
You can easily do the conversion by https://docs.nvidia.com/nemo-framework/user-guide/latest/vlms/energondataprep.html . It will load each sample on the fly instead of load it all.
Is your feature request related to a problem? Please describe.
The llava-like dataset implementation
vlm.NevaLazyDataModule
with thejson
file will initialize the dataset withjson.load()
here.The JSON file will be huge with a much larger dataset. With 558K images and their captions, the JSON file is 172 MB. With 133K images and their captions, the JSON file is 42 MB. This indicates a linear growth of the JSON file with the number of images (and their conversations captions)
We'd like to test out with a 128M image dataset, with this calculation, the JSON file will be about 40 GB large so will create a large stress to the CRAM when the line
json.load()
executes.This problem will get worse when PyTorch DataLoader replicates the initialized Python list across the processes with this known PyTorch issue.
Describe the solution you'd like
Are there considerations of using a different format to store the conversatoins captions? Storing the conversations along with the images as individual files can help mitigate this issue.
If the JSON file is a must-have, are there considerations of using packages such as ijson to stream or lazy load the JSON file?
Describe alternatives you've considered
I do notice a "jsonl" option here though I haven't tried it out.
The text was updated successfully, but these errors were encountered: