-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset v2.0 #461
Draft
aliberts
wants to merge
53
commits into
main
Choose a base branch
from
user/aliberts/2024_09_25_reshape_dataset
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Dataset v2.0 #461
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
aliberts
added
✨ Enhancement
New feature or request
🗃️ Dataset
Something dataset-related
labels
Oct 3, 2024
…_25_reshape_dataset
…_25_reshape_dataset
…_25_reshape_dataset
Cadene
reviewed
Oct 11, 2024
'~/.cache/huggingface/lerobot'. | ||
episodes (list[int] | None, optional): If specified, this will only load episodes specified by | ||
their episode_index in this list. Defaults to None. | ||
split (str, optional): _description_. Defaults to "train". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were removing split?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed it 8bd406e (it wasn't used anymore).
I suggest we just allow to keep a notion of split in the info.json
as I've done in the conversion script:
"splits": {
"train": "0:50"
}
…_25_reshape_dataset
…_25_reshape_dataset
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this does
This PR introduces a new format for
LeRobotDataset
, which is accompanied by a new file structure. As these changes are not backward compatible, we increaseCODEBASE_VERSION
fromv1.6
tov2.0
.What do I need to do?
If you already pushed a dataset using
v1.6
of our codebase, you can use the conversion scriptlerobot/common/datasets/v2/batch_convert_dataset_v1_to_v2.py
to convert it to the new format.You will be ask to enter a prompt describing the task performed in the dataset.
Examples for single-task dataset:
python lerobot/common/datasets/v2/batch_convert_dataset_v1_to_v2.py \ --repo-id lerobot/aloha_sim_insertion_human_image \ --task "Insert the peg into the socket." \ --robot-config lerobot/configs/robot/aloha.yaml
python lerobot/common/datasets/v2/batch_convert_dataset_v1_to_v2.py \ --repo-id aliberts/koch_tutorial \ --task "Pick the Lego block and drop it in the box on the right." \ --robot-config lerobot/configs/robot/koch.yaml
For the more complicated cases of one task per episode of multiple tasks per episodes, please refer to the documentation in that script.
Motivation
Current implementation of our
LeRobotDataset
suffers from a few shortcomings which make it not easy to use on some aspects. Specifically:datasets
andhuggingface_hub
makes it not convenient to create datasets locally (with recording). In order to use the newly created files on disk, these libraries check if those files are present in the cache (which they won't) and if not, will download them even though they may already be on disk.VideoFrame
not yet being integrated intodatasets
.Changes
Some of the biggest change comes from the new file structure and their content:
Note that this file-based structure is designed to be as versatile as possible. The parquet files are split by episodes (this was already the case for videos) which allows a much more granular control over which episodes one wants to use and download. The structure of the dataset is entirely described in the info.json file, which can be easily downloaded or viewed directly on the hub before downloading any actual data. The type of files used are very simple and do not need complex tools to be read, it only uses
.parquet
,.json
,.jsonl
and.mp4
files (.md
for the README).Added
info.json
(keys, shapes, number of episodes, etc.). It should serve as a source of truth for what's inside the dataset.episodes.jsonl
contains per episode information (episode_index, tasks in natural language and episode lengths)Changed
stats.safetensors
is nowstats.json
(the content remains the same but it's unflattened)Removed
episode_data_index.safetensors
Performance
In the nominal case (no
delta_timestamp
),LeRobotDataset.__get_item__()
is twice as fast on average:Using
delta_timestamps
, the gains are less significant but still there:Fixes
load_previous_and_future_frames
which didn't actually raise an error when the requested timestamps fromdelta_timestamps
did not correspond to actual timestamps in the dataset."tf.Tensor(b'Do something', shape=(), dtype=string)"
)lerobot/aloha_mobile_shrimp
lerobot/aloha_static_battery
lerobot/aloha_static_fork_pick_up
lerobot/aloha_static_thread_velcro
lerobot/uiuc_d3field
lerobot/viola
is missing video keys [TODO]How it was tested
TODO
How to checkout & try? (for the reviewer)
On this branch, you can also try out the new feature to select / download specific episodes: