A flexible, modular framework based on Stable Baselines3 for training reinforcement learning (RL) and imitation learning (IL) agents across diverse robotic simulation environments and demonstration datasets.
This framework provides a unified interface for training policies using either simulation environments or demonstration datasets. The architecture ensures complete independence between data sources (environments vs datasets) and policy implementations, allowing seamless switching between training paradigms while maintaining consistent data processing pipelines.
Researchers working on:
- Robot learning algorithms
- Multi-task policy learning
- Sim-to-real transfer
- Comparative studies across environments/datasets
Practitioners needing:
- Rapid prototyping of robot policies
- Flexible experimentation with different architectures
- Unified training pipeline across multiple simulators
- Environment-agnostic: Works with Isaac Lab, ManiSkill, MuJoCo Playground, Aloha and other custom Gym environments
- Dataset-agnostic: Compatible with LeRobot datasets
- Shared data format ensures policies work seamlessly across sources
- Supervised Learning: Behavior cloning from demonstrations
- On-policy RL: PPO, Recurrent PPO, Transformer PPO (! efficient rollout buffer implementation !)
- Off-policy RL: SAC, FastSAC
- Distributed training with Accelerate
- Mixed precision training (FP16/BF16)
- Gradient accumulation
- Comprehensive logging (TensorBoard, Weights & Biases)
- Automatic checkpointing and resumption
- Extensive test suite
The framework follows a layered architecture that separates concerns:
-
Data Sources
- Environments: Isaac Lab, ManiSkill, MuJoCo Playground, Aloha (via SB3 Wrapper), or contribute implementing your own!
- Datasets: LeRobot demonstrations (via DS Wrapper)
- Both produce standardized observation/action dictionaries
-
Preprocessors
- Transform source-specific data formats to policy inputs
- Handle normalization, image processing, sequence formatting
- Examples:
Gym_2_Mlp,Gym_2_Lstm,Gym_2_Sac,Aloha_2_Lstm
-
Agents
- Manage training loops (rollout collection, batch optimization)
- Interface with policies through preprocessors
- Handle loss computation and gradient updates
- Examples:
PPO,RecurrentPPO,TransformerPPO,SAC,FastSAC,SL
-
Policies
- Neural network architectures (actor-critic or standalone)
- Independent of data source or training algorithm
- Examples:
MlpPolicy,LSTMPolicy,TransformerPolicy,SACPolicy,FastSACPolicy,TCNPolicy
-
Entry Points
train.py: Online RL from simulationtrain_off.py: Offline IL from demonstrationspredict.py: Policy evaluation and deployment
Environment/Dataset → Wrapper → Preprocessor → Policy → Agent → Optimization
↓ ↓ ↓ ↓
Standardized Normalized Actions Loss
Format Inputs
- Python 3.11+
- CUDA 11.8+ (for GPU training)
# Clone repository
git clone https://github.com/johnMinelli/stable-baselines3-devkit
cd src
# Install dependencies
pip install -r requirements.txtTrain a PPO agent with MLP policy on Isaac Lab:
python train.py \
--task Isaac-Lift-Cube-Franka-v0 \
--envsim isaaclab \
--agent custom_ppo_mlp \
--num_envs 4096 \
--device cuda \
--headlessTrain a Recurrent PPO agent with LSTM policy:
python train.py \
--task Isaac-Velocity-Flat-Anymal-D-v0 \
--envsim isaaclab \
--agent custom_ppo_lstm \
--num_envs 2048 \
--device cuda \
--headlessOffline IL Training (Demonstrations required e.g. ManiSkill_StackCube-v1 )
Train an LSTM policy via behavior cloning:
python train_off.py \
--task SL \
--agent Lerobot/StackCube/lerobot_sl_lstm_cfg \
--device cuda \
--n_epochs 200 \
--batch_size 64Evaluate a trained policy:
python predict.py \
--task Isaac-Velocity-Flat-Anymal-D-v0 \
--envsim isaaclab \
--agent custom_ppo_mlp \
--num_envs 1 \
--val_episodes 100 \
--device cuda \
--resume
(opt) --checkpoint path/to/best_model.zipData flows from the data source to the policy after being converted to a shared format before the policy-specific processing.
The element that performs the conversion to what the policy expects is the env_2_policy specific processor. There you define the proc_observation_space, proc_action_space, and in the forward / forward_post methods you perform pre- and post-processing to match the defined processed spaces (that is what the policy expects).
The processor expects data in the shared data format, whether it comes from a dataset or an environment, structured similarly as:
observation_space = spaces.Dict({
"state": ...
"images": ...
})
- For LeRobot datasets, we process data to assume such shape in
common.datasets.dataloader:DataLoader, then the processor does the last operations specifically for the policy. Note that LeRobot datasets should be consistent enough to be managed by the same dataloader. - For environments, we process data to assume such shape in
common.envs.sb3_env_wrapper:Sb3EnvStdWrapper, then the processor does the last operations specifically for the policy.
The hard reality is that each data source differs, so you will likely need a wrapper class. For example, ManiSkillEnvStdWrapper prepares the data this way:
self.single_observation_space = spaces.Dict({
"policy": spaces.Box(-math.inf, math.inf, self.env.single_observation_space["state"].shape),
**{cam_name: cam["rgb"] for cam_name, cam in self.env.single_observation_space["sensor_data"].items()}
})
because Sb3EnvStdWrapper (in this case wrapping ManiSkillEnvStdWrapper(ManiSkillEnv)) establishes a general observation space that has to be true for all possible wrapped environments being:
observation_space = spaces.Dict({
"state": self.env.single_observation_space["policy"],
"images": spaces.Dict({k:v for k, v in self.env.single_observation_space.items() if k!="policy"})
})
TLDR: If your environment does not work well with how Sb3EnvStdWrapper groups the data, create a wrapper chain like Sb3EnvStdWrapper(CustomEnvStdWrapper(CustomEnv)) — this way you maintain datasource interchangeability. Then create a customenv_2_policy processor to handle your specific needs.
Policies must inherit from BasePolicy and implement required methods:
from stable_baselines3.common.policies import BasePolicy
class CustomPolicy(BasePolicy):
def __init__(self, observation_space, action_space, lr_schedule, **kwargs):
super().__init__(observation_space, action_space, ...)
# Define networks
def forward(self, obs):
# Compute actions, values, log_probs
return actions, values, log_probs
def predict_values(self, obs):
# Compute state values
return valuesRegister in agent's policy_aliases:
class PPO(OnPolicyAlgorithm):
policy_aliases = {
"MlpPolicy": MlpPolicy,
"CustomPolicy": CustomPolicy,
}Preprocessors bridge data sources to policies:
from common.preprocessor import Preprocessor
class CustomPreprocessor(Preprocessor):
def preprocess(self, obs):
# Transform observations for policy
processed = self.normalize_observations(obs)
# Additional transformations
return processed
def postprocess(self, actions):
# Transform policy outputs for environment
return self.unnormalize_actions(actions)- Create environment-specific wrapper:
class NewEnvWrapper(gym.Wrapper):
def __init__(self, env):
super().__init__(env)
# Setup observation/action spaces
def reset(self):
# Return standardized observations
def step(self, action):
# Return standardized obs, reward, done, info- Create preprocessor for the environment:
class NewEnv_2_Policy(Preprocessor):
# Implement preprocessing logic-
Add YAML configuration file in
configs/agents/NewEnv/ -
Update
train.pyimports:
if args_cli.envsim == "newenv":
import newenv_packageYes, You can! :)
If you use this framework in your research, please cite:
@misc{stable-baselines3-devkit,
title = {Stable Baselines3 DevKit},
author = {Giovanni Minelli},
year = {2026},
url = {https://github.com/johnMinelli/stable-baselines3-devkit}
}This framework builds upon:
