Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset v2.0 #461

Draft
wants to merge 53 commits into
base: main
Choose a base branch
from
Draft

Dataset v2.0 #461

wants to merge 53 commits into from

Conversation

aliberts
Copy link
Collaborator

@aliberts aliberts commented Oct 3, 2024

What this does

This PR introduces a new format for LeRobotDataset, which is accompanied by a new file structure. As these changes are not backward compatible, we increase CODEBASE_VERSION from v1.6 to v2.0.

What do I need to do?

If you already pushed a dataset using v1.6 of our codebase, you can use the conversion script lerobot/common/datasets/v2/batch_convert_dataset_v1_to_v2.py to convert it to the new format.
You will be ask to enter a prompt describing the task performed in the dataset.

Examples for single-task dataset:

python lerobot/common/datasets/v2/batch_convert_dataset_v1_to_v2.py \
    --repo-id lerobot/aloha_sim_insertion_human_image \
    --task "Insert the peg into the socket." \
    --robot-config lerobot/configs/robot/aloha.yaml
python lerobot/common/datasets/v2/batch_convert_dataset_v1_to_v2.py \
    --repo-id aliberts/koch_tutorial \
    --task "Pick the Lego block and drop it in the box on the right." \
    --robot-config lerobot/configs/robot/koch.yaml

For the more complicated cases of one task per episode of multiple tasks per episodes, please refer to the documentation in that script.

Motivation

Current implementation of our LeRobotDataset suffers from a few shortcomings which make it not easy to use on some aspects. Specifically:

  • The structure of the files does not accurately reflect the data structure. Our datasets are structured by episodes, which contrasts with a typical ML scenarios with train/val/test splits (although these concepts can still be relevant here). This makes it hard to easily select a subset of episodes from a dataset since the whole dataset has to be downloaded/loaded. Related: #440
  • The format is not transparent to the user: in order to get information about the content of a dataset, current options are limited to download the entire dataset and inspect it with a custom script, or try to visualize it using our visualization tool. Related: #383
  • The default file cache system used by datasets and huggingface_hub makes it not convenient to create datasets locally (with recording). In order to use the newly created files on disk, these libraries check if those files are present in the cache (which they won't) and if not, will download them even though they may already be on disk.
  • Some file format used are too framework specific for this format to be more universal (e.g. .safetensors)
  • The dataset viewer on the hub is not compatible with our datasets due to VideoFrame not yet being integrated into datasets.
  • The current implementation lacks support for future features that we may want to add such as:
    • Task-tokens-conditioned training
    • Multirobot policies
    • Depth images (Related: #435)

Changes

Some of the biggest change comes from the new file structure and their content:

  .
  ├── data
- │   ├── train-00000-of-0001.parquet
+ │   ├── chunk-000
+ │   │   ├── episode_000000.parquet
+ │   │   ├── episode_000001.parquet
+ │   │   ├── episode_000002.parquet
+ │   │   └── ...
+ │   ├── chunk-001
+ │   │   ├── episode_001000.parquet
+ │   │   ├── episode_001001.parquet
+ │   │   ├── episode_001002.parquet
+ │   │   └── ...
+ │   └── ...
- ├── meta_data
+ ├── meta
- │   ├── episode_data_index.safetensors
- │   ├── stats.safetensors
+ │   ├── episodes.jsonl
  │   ├── info.json
+ │   ├── stats.json
+ │   └── tasks.jsonl
  └── videos
+     ├── chunk-000
+     │   ├── observation.images.laptop
      │   │   ├── episode_000000.mp4
      │   │   ├── episode_000001.mp4
      │   │   ├── episode_000002.mp4
      │   │   └── ...
+     │   ├── observation.images.phone
      │   │   ├── episode_000000.mp4
      │   │   ├── episode_000001.mp4
      │   │   ├── episode_000002.mp4
      │   │   └── ...
+     ├── chunk-001
      └── ...

Note that this file-based structure is designed to be as versatile as possible. The parquet files are split by episodes (this was already the case for videos) which allows a much more granular control over which episodes one wants to use and download. The structure of the dataset is entirely described in the info.json file, which can be easily downloaded or viewed directly on the hub before downloading any actual data. The type of files used are very simple and do not need complex tools to be read, it only uses .parquet, .json, .jsonl and .mp4 files (.md for the README).

Added

  • Various information about the structure of the dataset has been added and is now centralized in the info.json (keys, shapes, number of episodes, etc.). It should serve as a source of truth for what's inside the dataset.
  • episodes.jsonl contains per episode information (episode_index, tasks in natural language and episode lengths)

Changed

  • stats.safetensors is now stats.json (the content remains the same but it's unflattened)

Removed

  • episode_data_index.safetensors

Performance

In the nominal case (no delta_timestamp), LeRobotDataset.__get_item__() is twice as fast on average:

REPO_ID = "lerobot/aloha_sim_insertion_human"
dataset = LeRobotDataset(repo_id=REPO_ID)
durations = []
num_iterations = 1000
for i in range(num_iterations):
    start = time.perf_counter()
    item = dataset[i]
    duration = time.perf_counter() - start
    durations.append(duration)

avg_duration = torch.Tensor(durations).mean()
print(f"{num_iterations=}  {avg_duration=:.4f}s")
# v1.6
num_iterations=1000  avg_duration=0.0066s

# v2.0
num_iterations=1000  avg_duration=0.0032s

Using delta_timestamps, the gains are less significant but still there:

delta_timestamps = {
    "observation.images.top": [-1, -1/50, 0, 25/50],
    "observation.state": [-1, -1/50, 0, 25/50],
}
dataset = LeRobotDataset(repo_id=REPO_ID, delta_timestamps=delta_timestamps)
# v1.6
num_iterations=1000  avg_duration=0.0467s

# v2.0
num_iterations=1000  avg_duration=0.0303s

Fixes

  • Fix a bug in load_previous_and_future_frames which didn't actually raise an error when the requested timestamps from delta_timestamps did not correspond to actual timestamps in the dataset.
  • Various fixes on the datasets have been made:
    • Some tasks already present in some datasets contained strings which were not part of the task (e.g. "tf.Tensor(b'Do something', shape=(), dtype=string)")
    • Some video files were not properly tracked by git lfs
    • Some datasets present a mismatch between the number of episodes in their parquet and the number of video files. This is being investigated [TODO]
      • lerobot/aloha_mobile_shrimp
      • lerobot/aloha_static_battery
      • lerobot/aloha_static_fork_pick_up
      • lerobot/aloha_static_thread_velcro
      • lerobot/uiuc_d3field
    • lerobot/viola is missing video keys [TODO]

How it was tested

TODO

How to checkout & try? (for the reviewer)

import time
import torch

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

# On the main branch, using v1.6
REPO_ID = "lerobot/aloha_sim_insertion_human"  # try with '_image' as well

# On this branch, using v2.0
REPO_ID = "aliberts/aloha_sim_insertion_human"  # try with '_image' as well

delta_timestamps = {
    "observation.images.top": [-1, -1/50, 0, 25/50],
    "observation.state": [-1, -1/50, 0, 25/50],
}
dataset = LeRobotDataset(repo_id=REPO_ID, delta_timestamps=delta_timestamps)

durations = []
num_iterations = 1000
for i in range(num_iterations):
    start = time.perf_counter()
    item = dataset[i]
    duration = time.perf_counter() - start
    durations.append(duration)

avg_duration = torch.Tensor(durations).mean()
print(f"{num_iterations=}  {avg_duration=:.4f}s")

On this branch, you can also try out the new feature to select / download specific episodes:

dataset = LeRobotDataset(repo_id=REPO_ID, episodes=[1, 10, 12, 40])

@aliberts aliberts added ✨ Enhancement New feature or request 🗃️ Dataset Something dataset-related labels Oct 3, 2024
@aliberts aliberts self-assigned this Oct 3, 2024
@Cadene Cadene self-requested a review October 11, 2024 15:10
lerobot/common/datasets/lerobot_dataset.py Outdated Show resolved Hide resolved
'~/.cache/huggingface/lerobot'.
episodes (list[int] | None, optional): If specified, this will only load episodes specified by
their episode_index in this list. Defaults to None.
split (str, optional): _description_. Defaults to "train".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were removing split?

Copy link
Collaborator Author

@aliberts aliberts Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed it 8bd406e (it wasn't used anymore).
I suggest we just allow to keep a notion of split in the info.json as I've done in the conversion script:

"splits": {
    "train": "0:50"
}

lerobot/common/datasets/lerobot_dataset.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗃️ Dataset Something dataset-related ✨ Enhancement New feature or request
Projects
Status: In Progress
2 participants