Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open-X datasets #435

Open
nikonikolov opened this issue Sep 12, 2024 · 4 comments
Open

Open-X datasets #435

nikonikolov opened this issue Sep 12, 2024 · 4 comments
Assignees
Labels
🗃️ Dataset Something dataset-related

Comments

@nikonikolov
Copy link

Thanks for the great work! I am interested in converting more of the open-x datasets to LeRobotDataset.

  • I was wondering if there was any particular reason the entire open-x wasn't added already, e.g. some difficulties you encountered with some specific datasets?
  • Do you have any tips where I should be extra careful when converting from RLDS to LeRobotDataset or it's generally as easy as calling the conversion script?
@aliberts
Copy link
Collaborator

Hi there, thank you!

We indeed had difficulties on some of them (which is why they're not uploaded yet) but don't worry they're coming ;)

Among the issues, some of them are massive (1>TB) and the hub has some limitations in terms of storage and file system. We also faced some issues during video encoding. Right now we are focusing on refactoring LeRobotDataset so we'll probably add the remaining ones after that.
cc @michel-aractingi

@aliberts aliberts added the 🗃️ Dataset Something dataset-related label Sep 24, 2024
@nikonikolov
Copy link
Author

nikonikolov commented Sep 24, 2024

Hi, thanks for the reply. I actually did end up converting 90% of the openx datasets using a fork of lerobot. Sharing some findings which I think are important and you might find useful:

  • lerobot/common/datasets/push_dataset_to_hub/openx/transforms.py needs to be reworked. I haven't gotten through all of it yet, but some of these transforms perform randomization (e.g. for droid) which is not what you want for the raw dataset. I believe these transforms, used by OpenVLA and Octo are meant for train-time.
  • A lot of the raw data in the RLDS dataset is lost
    • For example, fractal has many more fields inside observation and action which aren't propagated to the converted dataset. Same case for many other datasets
    • Depth images don't get propagated either (haven't figured out yet how they are encoded in the RLDS datasets)
    • Episode metadata is completely lost too. For example, the kuka dataset uses the success field to filter episodes for training
  • My take is: convert RLDS to HF Dataset without any loss of data and leave the user to decide how to filter or what information to use from the dataset
  • The way I handled episode metadata - create a separate HF dataset as a field of LeRobotDataset which stores one row per episode for the metadata
  • Image encoding: converting images to JPG can reduce the dataset size by about x10, e.g. fractal or language table. This avoids the complications with video encoding and directly utilizes the pyarrow format used by HF datasets. In my experience, PNG -> JPG it doesn't lead to degradation in performance
  • Multiprocessing and memory - during conversion, if you don't want to wait for days, just to run OOM, you need a multiprocessing implementation which periodically saves processed data to disk and at the end concatenates everything in a single dataset. Instead of torch state dicts, I used HF datasets for saving temporary 'shards', as torch state dicts don't seem to handle memory as well as HF datasets

Happy to share code or discuss further, hope this helps :)

@Cadene
Copy link
Collaborator

Cadene commented Sep 24, 2024

cc @michel-aractingi for visibility.

@nikonikolov
Copy link
Author

nikonikolov commented Sep 25, 2024

Also, forgot to mention, some datasets have flipped image channels (BGR instead of RGB). Following https://github.com/kpertsch/rlds_dataset_mod/blob/main/prepare_open_x.sh:
berkeley_autolab_ur5: flip_wrist_image_channels
stanford_hydra_dataset_converted_externally_to_rlds: flip_wrist_image_channels,flip_image_channels
utaustin_mutex: flip_wrist_image_channels,flip_image_channels
berkeley_fanuc_manipulation: flip_wrist_image_channels,flip_image_channels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗃️ Dataset Something dataset-related
Projects
None yet
Development

No branches or pull requests

4 participants