Dataset v2.0 #461

aliberts · 2024-10-03T18:02:10Z

What this does

This PR introduces a new format for LeRobotDataset, which is accompanied by a new file structure. As these changes are not backward compatible, we increase CODEBASE_VERSION from v1.6 to v2.0.

What do I need to do?

If you already pushed a dataset using v1.6 of our codebase, you can use the conversion script lerobot/common/datasets/v2/batch_convert_dataset_v1_to_v2.py to convert it to the new format.
You will be ask to enter a prompt describing the task performed in the dataset.

Examples for single-task dataset:

python lerobot/common/datasets/v2/batch_convert_dataset_v1_to_v2.py \
    --repo-id lerobot/aloha_sim_insertion_human_image \
    --task "Insert the peg into the socket." \
    --robot-config lerobot/configs/robot/aloha.yaml

python lerobot/common/datasets/v2/batch_convert_dataset_v1_to_v2.py \
    --repo-id aliberts/koch_tutorial \
    --task "Pick the Lego block and drop it in the box on the right." \
    --robot-config lerobot/configs/robot/koch.yaml

For the more complicated cases of one task per episode of multiple tasks per episodes, please refer to the documentation in that script.

Motivation

Current implementation of our LeRobotDataset suffers from a few shortcomings which make it not easy to use on some aspects. Specifically:

The structure of the files does not accurately reflect the data structure. Our datasets are structured by episodes, which contrasts with a typical ML scenarios with train/val/test splits (although these concepts can still be relevant here). This makes it hard to easily select a subset of episodes from a dataset since the whole dataset has to be downloaded/loaded. Related: #440
The format is not transparent to the user: in order to get information about the content of a dataset, current options are limited to download the entire dataset and inspect it with a custom script, or try to visualize it using our visualization tool. Related: #383
The default file cache system used by datasets and huggingface_hub makes it not convenient to create datasets locally (with recording). In order to use the newly created files on disk, these libraries check if those files are present in the cache (which they won't) and if not, will download them even though they may already be on disk.
Some file format used are too framework specific for this format to be more universal (e.g. .safetensors)
The dataset viewer on the hub is not compatible with our datasets due to VideoFrame not yet being integrated into datasets.
The current implementation lacks support for future features that we may want to add such as:
- Task-tokens-conditioned training
- Multirobot policies
- Depth images (Related: #435)

Changes

Some of the biggest change comes from the new file structure and their content:

  .
  ├── data
- │   ├── train-00000-of-0001.parquet
+ │   ├── chunk-000
+ │   │   ├── episode_000000.parquet
+ │   │   ├── episode_000001.parquet
+ │   │   ├── episode_000002.parquet
+ │   │   └── ...
+ │   ├── chunk-001
+ │   │   ├── episode_001000.parquet
+ │   │   ├── episode_001001.parquet
+ │   │   ├── episode_001002.parquet
+ │   │   └── ...
+ │   └── ...
- ├── meta_data
+ ├── meta
- │   ├── episode_data_index.safetensors
- │   ├── stats.safetensors
+ │   ├── episodes.jsonl
  │   ├── info.json
+ │   ├── stats.json
+ │   └── tasks.jsonl
  └── videos
+     ├── chunk-000
+     │   ├── observation.images.laptop
      │   │   ├── episode_000000.mp4
      │   │   ├── episode_000001.mp4
      │   │   ├── episode_000002.mp4
      │   │   └── ...
+     │   ├── observation.images.phone
      │   │   ├── episode_000000.mp4
      │   │   ├── episode_000001.mp4
      │   │   ├── episode_000002.mp4
      │   │   └── ...
+     ├── chunk-001
      └── ...

Note that this file-based structure is designed to be as versatile as possible. The parquet files are split by episodes (this was already the case for videos) which allows a much more granular control over which episodes one wants to use and download. The structure of the dataset is entirely described in the info.json file, which can be easily downloaded or viewed directly on the hub before downloading any actual data. The type of files used are very simple and do not need complex tools to be read, it only uses .parquet, .json, .jsonl and .mp4 files (.md for the README).

Added

Various information about the structure of the dataset has been added and is now centralized in the info.json (keys, shapes, number of episodes, etc.). It should serve as a source of truth for what's inside the dataset.
episodes.jsonl contains per episode information (episode_index, tasks in natural language and episode lengths)

Changed

stats.safetensors is now stats.json (the content remains the same but it's unflattened)

Removed

episode_data_index.safetensors

Performance

In the nominal case (no delta_timestamp), LeRobotDataset.__get_item__() is twice as fast on average:

REPO_ID = "lerobot/aloha_sim_insertion_human"
dataset = LeRobotDataset(repo_id=REPO_ID)
durations = []
num_iterations = 1000
for i in range(num_iterations):
    start = time.perf_counter()
    item = dataset[i]
    duration = time.perf_counter() - start
    durations.append(duration)

avg_duration = torch.Tensor(durations).mean()
print(f"{num_iterations=}  {avg_duration=:.4f}s")

# v1.6
num_iterations=1000  avg_duration=0.0066s

# v2.0
num_iterations=1000  avg_duration=0.0032s

Using delta_timestamps, the gains are less significant but still there:

delta_timestamps = {
    "observation.images.top": [-1, -1/50, 0, 25/50],
    "observation.state": [-1, -1/50, 0, 25/50],
}
dataset = LeRobotDataset(repo_id=REPO_ID, delta_timestamps=delta_timestamps)

# v1.6
num_iterations=1000  avg_duration=0.0467s

# v2.0
num_iterations=1000  avg_duration=0.0303s

Fixes

Fix a bug in load_previous_and_future_frames which didn't actually raise an error when the requested timestamps from delta_timestamps did not correspond to actual timestamps in the dataset.
Various fixes on the datasets have been made:
- Some tasks already present in some datasets contained strings which were not part of the task (e.g. "tf.Tensor(b'Do something', shape=(), dtype=string)")
- Some video files were not properly tracked by git lfs
- Some datasets present a mismatch between the number of episodes in their parquet and the number of video files. This is being investigated [TODO]
  - lerobot/aloha_mobile_shrimp
  - lerobot/aloha_static_battery
  - lerobot/aloha_static_fork_pick_up
  - lerobot/aloha_static_thread_velcro
  - lerobot/uiuc_d3field
- lerobot/viola is missing video keys [TODO]

How it was tested

TODO

How to checkout & try? (for the reviewer)

import time
import torch

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

# On the main branch, using v1.6
REPO_ID = "lerobot/aloha_sim_insertion_human"  # try with '_image' as well

# On this branch, using v2.0
REPO_ID = "aliberts/aloha_sim_insertion_human"  # try with '_image' as well

delta_timestamps = {
    "observation.images.top": [-1, -1/50, 0, 25/50],
    "observation.state": [-1, -1/50, 0, 25/50],
}
dataset = LeRobotDataset(repo_id=REPO_ID, delta_timestamps=delta_timestamps)

durations = []
num_iterations = 1000
for i in range(num_iterations):
    start = time.perf_counter()
    item = dataset[i]
    duration = time.perf_counter() - start
    durations.append(duration)

avg_duration = torch.Tensor(durations).mean()
print(f"{num_iterations=}  {avg_duration=:.4f}s")

On this branch, you can also try out the new feature to select / download specific episodes:

dataset = LeRobotDataset(repo_id=REPO_ID, episodes=[1, 10, 12, 40])

…_25_reshape_dataset

lerobot/common/datasets/lerobot_dataset.py

Cadene · 2024-10-11T15:38:20Z

lerobot/common/datasets/lerobot_dataset.py

+ '~/.cache/huggingface/lerobot'.
+ episodes (list[int] | None, optional): If specified, this will only load episodes specified by
+ their episode_index in this list. Defaults to None.
+ split (str, optional): _description_. Defaults to "train".


I thought we were removing split?

I've removed it 8bd406e (it wasn't used anymore).
I suggest we just allow to keep a notion of split in the info.json as I've done in the conversion script:

"splits": { "train": "0:50" }

lerobot/common/datasets/lerobot_dataset.py

…_25_reshape_dataset

WIP

ad115b6

aliberts added ✨ Enhancement New feature or request 🗃️ Dataset Something dataset-related labels Oct 3, 2024

aliberts self-assigned this Oct 3, 2024

aliberts linked an issue Oct 3, 2024 that may be closed by this pull request

[Feature Request] Add Detailed Information about Observation Fields to Metadata File in leRobotDataset Repository #383

Open

aliberts added 11 commits October 4, 2024 11:22

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_09…

17a1214

…_25_reshape_dataset

Add upload folders

1016a98

Add info.json link

07e113c

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_09…

028c17f

…_25_reshape_dataset

Add pixel channels

21ba4b5

Update info.json format

2d75b93

Rework LeRobotDataset.__init__

096824b

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_09…

3113038

…_25_reshape_dataset

Update LeRobotDataset.__get_item__

b417ceb

Add doc, scrap video_frame_keys attribute

6d2bc11

Add huggingface-hub patch for offline snapshot_download with local_dir

7f68088

Cadene self-requested a review October 11, 2024 15:10

Add padding keys and download_data option

3ea5312

Cadene reviewed Oct 11, 2024

View reviewed changes

aliberts added 11 commits October 11, 2024 18:52

Add suggestions from code review

8bd406e

Add multitask support, refactor conversion script

cf63334

Extend v1 compatibility

cbc51e1

Fix safe_version

f96773d

Cleanup, fix load_tasks

835ab5a

Update load_tasks doc

da78bbf

WIP add batch convert

9433ac5

Add fixes for batch convert

1102640

Add episode chunks logic, move_videos & lfs tracking fix

c146ba9

Write episodes as jsonlines

50a75ad

Add fixes for lfs tracking

ad3f112

aliberts added 29 commits October 17, 2024 13:08

Add batch conversion log

3ee3739

Cleanup

7242c57

Add unitreeh and umi

d0d8193

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_09…

354f37a

…_25_reshape_dataset

Update doc

be64d54

Cleanup

beacb7e

Add copyrights

3a9f964

Remove caret requirement

91e8ce7

Fix episodes.jsonl

e7355ba

Add download_metadata, move default paths

1a51505

Add load_metadata

bce3dc3

Move default paths, use jsonlines for tasks

ac3798b

Add file paths

9316cf4

Change card creation

e46bdb9

Add ImageWriter

3b925c3

Add add_frame, empty dataset creation

c1232a0

Merge remote-tracking branch 'origin/main' into user/aliberts/2024_09…

9ebf8b8

…_25_reshape_dataset

Add add_episode & task logic

299451a

add delete_episode, WIP on consolidate

c4c0a43

Improve consistency between __init__() and create(), WIP on consolidate

e991a31

Add local_files_only, encode_videos, fix bugs to pass tests (WIP)

a805458

Add channels to intelrealsense

ee52b8b

Fix tests

b46db7e

Remove populate dataset

6c2cb6e

Fix paths & add add_frame doc

237a484

Remove total_episodes from default parquet path

c72dc23

Update & fix conversion script

c3c0141

Fix episode chunk

9dca233

Update dataset doc

fb73cdb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset v2.0 #461

Dataset v2.0 #461

aliberts commented Oct 3, 2024 •

edited

Loading

Cadene Oct 11, 2024

aliberts Oct 11, 2024 •

edited

Loading

Dataset v2.0 #461

Are you sure you want to change the base?

Dataset v2.0 #461

Conversation

aliberts commented Oct 3, 2024 • edited Loading

What this does

What do I need to do?

Motivation

Changes

Performance

Fixes

How it was tested

How to checkout & try? (for the reviewer)

Cadene Oct 11, 2024

Choose a reason for hiding this comment

aliberts Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

aliberts commented Oct 3, 2024 •

edited

Loading

aliberts Oct 11, 2024 •

edited

Loading