-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Image storage format #436
Comments
We compared png frames versus mp4 video compressed on Pusht and Aloha environments in simulation. We didnt notice lower success rate. You could reproduce this result as we currently support both images and video datasets. For instance:
As of now, we use parquet to store images: https://huggingface.co/datasets/lerobot/pusht_image/tree/main/data
Our current data format use parquet to store the data on hub, then arrow once it is downloaded in the cache (through HF dataset), and HF dataset load arrow data as pytorch tensors. It's fast enough for now. We are still iterating on the format to make it simpler and faster ; and scallable |
I am quite interested in using
LeRobotDataset
for large scale training. I am interested to get more context on the options for storing images so I am aware of the implications this might have:.mp4
or.pt
, but not inarrow
orparquet
format as many other HF datasets do. Is there any specific reason you didn't add support forarrow
/parquet
which also provide memory mapping? Any ideas how pytorch would compare toarrow
/parquet
when using datasets of 100s of millions of examples?The text was updated successfully, but these errors were encountered: