Stable Baselines3 DevKit

A flexible, modular framework based on Stable Baselines3 for training reinforcement learning (RL) and imitation learning (IL) agents across diverse robotic simulation environments and demonstration datasets.

Overview

This framework provides a unified interface for training policies using either simulation environments or demonstration datasets. The architecture ensures complete independence between data sources (environments vs datasets) and policy implementations, allowing seamless switching between training paradigms while maintaining consistent data processing pipelines.

Who Is This For?

Researchers working on:

Robot learning algorithms
Multi-task policy learning
Sim-to-real transfer
Comparative studies across environments/datasets

Practitioners needing:

Rapid prototyping of robot policies
Flexible experimentation with different architectures
Unified training pipeline across multiple simulators

Key Features

Unified Data Interface

Environment-agnostic: Works with Isaac Lab, ManiSkill, MuJoCo Playground, Aloha and other custom Gym environments
Dataset-agnostic: Compatible with LeRobot datasets
Shared data format ensures policies work seamlessly across sources

Flexible Training Algorithms

Supervised Learning: Behavior cloning from demonstrations
On-policy RL: PPO, Recurrent PPO, Transformer PPO (! efficient rollout buffer implementation !)
Off-policy RL: SAC, FastSAC

Out-of-the-box Features

Distributed training with Accelerate
Mixed precision training (FP16/BF16)
Gradient accumulation
Comprehensive logging (TensorBoard, Weights & Biases)
Automatic checkpointing and resumption
Extensive test suite

Source Code Architecture

The framework follows a layered architecture that separates concerns:

Component Hierarchy

Data Sources
- Environments: Isaac Lab, ManiSkill, MuJoCo Playground, Aloha (via SB3 Wrapper), or contribute implementing your own!
- Datasets: LeRobot demonstrations (via DS Wrapper)
- Both produce standardized observation/action dictionaries
Preprocessors
- Transform source-specific data formats to policy inputs
- Handle normalization, image processing, sequence formatting
- Examples: Gym_2_Mlp, Gym_2_Lstm, Gym_2_Sac, Aloha_2_Lstm
Agents
- Manage training loops (rollout collection, batch optimization)
- Interface with policies through preprocessors
- Handle loss computation and gradient updates
- Examples: PPO, RecurrentPPO, TransformerPPO, SAC, FastSAC, SL
Policies
- Neural network architectures (actor-critic or standalone)
- Independent of data source or training algorithm
- Examples: MlpPolicy, LSTMPolicy, TransformerPolicy, SACPolicy, FastSACPolicy, TCNPolicy
Entry Points
- train.py: Online RL from simulation
- train_off.py: Offline IL from demonstrations
- predict.py: Policy evaluation and deployment

Data Flow

Environment/Dataset → Wrapper → Preprocessor → Policy → Agent → Optimization
         ↓                         ↓             ↓        ↓
    Standardized              Normalized    Actions    Loss
       Format                   Inputs

Installation

Prerequisites

Python 3.11+
CUDA 11.8+ (for GPU training)

Core Installation

# Clone repository
git clone https://github.com/johnMinelli/stable-baselines3-devkit
cd src
# Install dependencies
pip install -r requirements.txt

Quick Start

Online RL Training (Simulation)

Train a PPO agent with MLP policy on Isaac Lab:

python train.py \
  --task Isaac-Lift-Cube-Franka-v0 \
  --envsim isaaclab \
  --agent custom_ppo_mlp \
  --num_envs 4096 \
  --device cuda \
  --headless

Train a Recurrent PPO agent with LSTM policy:

python train.py \
  --task Isaac-Velocity-Flat-Anymal-D-v0 \
  --envsim isaaclab \
  --agent custom_ppo_lstm \
  --num_envs 2048 \
  --device cuda \
  --headless

Offline IL Training (Demonstrations required e.g. ManiSkill_StackCube-v1 )

Train an LSTM policy via behavior cloning:

python train_off.py \
  --task SL \
  --agent Lerobot/StackCube/lerobot_sl_lstm_cfg \
  --device cuda \
  --n_epochs 200 \
  --batch_size 64

Policy Evaluation

Evaluate a trained policy:

python predict.py \
  --task Isaac-Velocity-Flat-Anymal-D-v0 \
  --envsim isaaclab \
  --agent custom_ppo_mlp \
  --num_envs 1 \
  --val_episodes 100 \
  --device cuda \
  --resume
  (opt) --checkpoint path/to/best_model.zip

Usage Guide

Shared data format

Data flows from the data source to the policy after being converted to a shared format before the policy-specific processing.

The element that performs the conversion to what the policy expects is the env_2_policy specific processor. There you define the proc_observation_space, proc_action_space, and in the forward / forward_post methods you perform pre- and post-processing to match the defined processed spaces (that is what the policy expects). The processor expects data in the shared data format, whether it comes from a dataset or an environment, structured similarly as:

  observation_space = spaces.Dict({
      "state": ...
      "images": ...
  })

For LeRobot datasets, we process data to assume such shape in common.datasets.dataloader:DataLoader, then the processor does the last operations specifically for the policy. Note that LeRobot datasets should be consistent enough to be managed by the same dataloader.
For environments, we process data to assume such shape in common.envs.sb3_env_wrapper:Sb3EnvStdWrapper, then the processor does the last operations specifically for the policy.

The hard reality is that each data source differs, so you will likely need a wrapper class. For example, ManiSkillEnvStdWrapper prepares the data this way:

  self.single_observation_space = spaces.Dict({
      "policy": spaces.Box(-math.inf, math.inf, self.env.single_observation_space["state"].shape),
      **{cam_name: cam["rgb"] for cam_name, cam in self.env.single_observation_space["sensor_data"].items()}
  })

because Sb3EnvStdWrapper (in this case wrapping ManiSkillEnvStdWrapper(ManiSkillEnv)) establishes a general observation space that has to be true for all possible wrapped environments being:

  observation_space = spaces.Dict({
      "state": self.env.single_observation_space["policy"],
      "images": spaces.Dict({k:v for k, v in self.env.single_observation_space.items() if k!="policy"})
  })

TLDR: If your environment does not work well with how Sb3EnvStdWrapper groups the data, create a wrapper chain like Sb3EnvStdWrapper(CustomEnvStdWrapper(CustomEnv)) — this way you maintain datasource interchangeability. Then create a customenv_2_policy processor to handle your specific needs.

Creating a New Policy

Policies must inherit from BasePolicy and implement required methods:

from stable_baselines3.common.policies import BasePolicy

class CustomPolicy(BasePolicy):
    def __init__(self, observation_space, action_space, lr_schedule, **kwargs):
        super().__init__(observation_space, action_space, ...)
        # Define networks

    def forward(self, obs):
        # Compute actions, values, log_probs
        return actions, values, log_probs

    def predict_values(self, obs):
        # Compute state values
        return values

Register in agent's policy_aliases:

class PPO(OnPolicyAlgorithm):
    policy_aliases = {
        "MlpPolicy": MlpPolicy,
        "CustomPolicy": CustomPolicy,
    }

Creating a Preprocessor

Preprocessors bridge data sources to policies:

from common.preprocessor import Preprocessor

class CustomPreprocessor(Preprocessor):
    def preprocess(self, obs):
        # Transform observations for policy
        processed = self.normalize_observations(obs)
        # Additional transformations
        return processed

    def postprocess(self, actions):
        # Transform policy outputs for environment
        return self.unnormalize_actions(actions)

Adding a New Environment

Create environment-specific wrapper:

class NewEnvWrapper(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)
        # Setup observation/action spaces

    def reset(self):
        # Return standardized observations

    def step(self, action):
        # Return standardized obs, reward, done, info

Create preprocessor for the environment:

class NewEnv_2_Policy(Preprocessor):
    # Implement preprocessing logic

Add YAML configuration file in configs/agents/NewEnv/
Update train.py imports:

if args_cli.envsim == "newenv":
    import newenv_package

Contributing

Yes, You can! :)

Citation

If you use this framework in your research, please cite:

@misc{stable-baselines3-devkit,
  title = {Stable Baselines3 DevKit},
  author = {Giovanni Minelli},
  year = {2026},
  url = {https://github.com/johnMinelli/stable-baselines3-devkit}
}

Acknowledgments

This framework builds upon:

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
algos		algos
common		common
configs		configs
data		data
doc		doc
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
predict.py		predict.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
train.py		train.py
train_off.py		train_off.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stable Baselines3 DevKit

Overview

Who Is This For?

Key Features

Unified Data Interface

Flexible Training Algorithms

Out-of-the-box Features

Source Code Architecture

Component Hierarchy

Data Flow

Installation

Prerequisites

Core Installation

Quick Start

Online RL Training (Simulation)

Offline IL Training (Demonstrations required e.g. ManiSkill_StackCube-v1 )

Policy Evaluation

Usage Guide

Shared data format

Creating a New Policy

Creating a Preprocessor

Adding a New Environment

Contributing

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

iit-DLSLab/stable-baselines3-devkit

Folders and files

Latest commit

History

Repository files navigation

Stable Baselines3 DevKit

Overview

Who Is This For?

Key Features

Unified Data Interface

Flexible Training Algorithms

Out-of-the-box Features

Source Code Architecture

Component Hierarchy

Data Flow

Installation

Prerequisites

Core Installation

Quick Start

Online RL Training (Simulation)

Offline IL Training (Demonstrations required e.g. ManiSkill_StackCube-v1 )

Policy Evaluation

Usage Guide

Shared data format

Creating a New Policy

Creating a Preprocessor

Adding a New Environment

Contributing

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages