demo.mp4
- [2026/2/26] π₯ We release OpenClaw-RL v1 β a fully asynchronous RL framework for training personalized AI agents from natural conversation feedback.
OpenClaw-RL is a fully asynchronous reinforcement learning framework that turns everyday conversations into training signals for personalized AI agents.
Most RL-for-LLM systems assume centralized, batch-mode training with pre-collected datasets. OpenClaw-RL takes a fundamentally different approach: it wraps your self-hosted model in OpenClaw as an OpenAI-compatible API, intercepts live multi-turn conversations, and continuously optimizes the policy in the background β all without interrupting your usage.
OpenClaw-RL decouples agent serving, rollout collection, PRM judging, and policy training into independent async loops. None of them block one another β the model serves requests while training runs in the background, and PRM evaluation happens concurrently with new conversations.
The entire stack (model, PRM, training) runs on your own infrastructure. Conversation data never leaves your system. No external API keys required.
You don't need to manually label data. The system automatically:
- Classifies API messages into main-line (trainable) vs. side (non-trainable) turns
- Uses the next user/environment message as a natural "next state" signal
- Runs PRM evaluation asynchronously with majority voting for robust scoring
- Submits ready samples to the trainer as they become available
Binary RL (GRPO): A Process Reward Model scores each turn as good/bad/neutral based on the next-state feedback. The scalar reward is used with GRPO advantage estimation and PPO-style clipped surrogate loss.
On-Policy Distillation (OPD): When the next state reveals useful hindsight, a judge model extracts a textual hint. This hint augments the original prompt to create an "enhanced teacher," whose token-level log-probability gap with the student becomes a directional advantage signal β richer than any scalar reward.
- Session-aware training: Multi-turn conversations are tracked per-session with proper turn ordering
- Graceful weight updates: Submission pauses during model updates, then resumes β no data corruption
- At-least-one guarantee (Binary RL): Every session contributes at least one effective training sample
- Hint quality filtering (OPD): Only the longest, most informative hint among
mvotes is selected; trivial hints are discarded - Teacher log-prob optimization (OPD): Only response-suffix log-probs are computed to reduce peak memory
- Record & debug: All conversations and PRM evaluations are logged to JSONL for analysis
Our long-term goal is to advance personalized, practically useful agents with reinforcement learning. The roadmap has two tracks:
β
Release v1: Fully async OpenClaw-RL framework with Binary RL + OPD
β¬ Broader model family support & more efficient serving
β¬ Best recipe discovery via large-scale experiments
β¬ Beyond the policy: extend learning to skills and memory
β¬ Next (2β3 weeks): Scalable agentic RL infra for general agents (computer-use first)
- Hardware: 8Γ GPUs (default; configurable via
NUM_GPUS,ACTOR_GPUS,ROLLOUT_GPUS,PRM_GPUS) - Software: CUDA 12.9, Python 3.12
- Framework: Slime (our base RL framework)
For detailed environment setup, see Slime or ./instructions/README.md.
We provide two methods (RL servers):
| Method | Signal Type | How It Works | When to Use |
|---|---|---|---|
| Binary RL | Scalar (+1/β1/0) | PRM judges response quality from next-state feedback via majority vote β GRPO | Abundant implicit feedback (likes, env success/failure) |
| On-Policy Distillation (OPD) | Token-level directional | Extract hindsight hints from next-state β construct enhanced teacher β token-level distillation | Rich textual feedback; need directional improvement |
Choose your optimization method:
Option A: Binary RL β Best for implicit feedback (likes/dislikes, env success/failure)
cd slime
bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.shThe PRM will automatically judge response quality from next-state feedback. We recommend providing frequent feedback (e.g., π/π) to help the model optimize effectively.
See ./openclaw-rl/README.md for algorithm details.
Option B: On-Policy Distillation (OPD) β Best for rich textual feedback
cd slime
bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.shThe system extracts hindsight hints from your feedback and distills them into the policy at the token level. We recommend providing concrete feedback (e.g., "you should have checked the file first" or "don't use that library").
See ./openclaw-opd/README.md for algorithm details.
Once running, the model is served as an OpenAI-compatible API at:
http://<HOST_IP>:30000/v1
where <HOST_IP> is the IP address of the machine running the RL server (e.g. 115.190.98.251). The port 30000 is the default and can be changed via the PORT environment variable.
Take note of this endpoint β you will need it when configuring OpenClaw in the next step.
Install OpenClaw from the version bundled in this repository (we will update it regularly):
Then configure OpenClaw to route requests to your RL server. Open your openclaw.json (or the equivalent settings file) and add a provider entry under "models" β "providers":
{
"models": {
"providers": {
"qwen": {
"baseUrl": "http://<HOST_IP>:30000/v1",
"apiKey": "apiKey",
"api": "openai-completions",
"models": [
{
"id": "qwen3-4b",
"name": "Qwen3 4B",
"reasoning": true,
"input": ["text"],
"cost": {
"input": 0,
"output": 0,
"cacheRead": 0,
"cacheWrite": 0
},
"contextWindow": 32768,
"maxTokens": 8192
}
]
}
}
}
}Replace <HOST_IP> with the IP address of your RL server machine. The apiKey should match the SGLANG_API_KEY you set when starting the server.
That's it β start chatting with your OpenClaw agent. The RL server will automatically collect conversation trajectories, compute rewards, and train the model. Your agent gets better the more you use it.
Before launching, set these important environment variables as needed:
| Variable | Default | Description |
|---|---|---|
NUM_GPUS |
8 |
Total GPUs available on the machine |
ACTOR_GPUS |
4 |
GPUs allocated to the training actor |
ROLLOUT_GPUS |
2 |
GPUs allocated to rollout generation |
PRM_GPUS |
2 |
GPUs allocated to the Process Reward Model |
HF_CKPT |
(see script) | Path to the base HuggingFace checkpoint |
PRM_MODEL_PATH |
(see script) | Path to the reward model HuggingFace checkpoint |
SAVE_CKPT |
(see script) | Path to the saved HuggingFace checkpoint |
SGLANG_API_KEY |
β | API key for the SGLang serving endpoint |
You can check more details about configurations in ./instructions .
@misc{wang2026openclawrl,
author = {Wang, Yinjie and Wang, Mengdi and Yang, Ling},
title = {OpenClaw-RL},
year = {2026},
organization = {GitHub},
url = {https://github.com/Gen-Verse/OpenClaw-RL},
}
@article{yu2025demystify,
title={Demystifying Reinforcement Learning in Agentic Reasoning},
author={Yu, Zhaochen and Yang, Ling and Zou, Jiaru and Yan, Shuicheng and Wang, Mengdi},
journal={arXiv preprint arXiv:2510.11701},
year={2025}
}
@article{wang2026rlanything,
title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System},
author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling},
journal={arXiv preprint arXiv:2602.02488},
year={2026}
}
This work aims to explore more effective paradigms for Agentic RL. Our implementation builds upon the excellent codebases of slime, OpenClaw and Open-AgentRL. We sincerely thank these projects for their valuable insights and high-quality implementations, which have greatly facilitated our research.
