OpenClaw-RL

Empowering OpenClaw with RL — Train a personalized agent simply by talking to it.

demo.mp4

📰 News

[2026/2/26] 🔥 We release OpenClaw-RL v1 — a fully asynchronous RL framework for training personalized AI agents from natural conversation feedback.

💡 TL;DR

OpenClaw-RL is a fully asynchronous reinforcement learning framework that turns everyday conversations into training signals for personalized AI agents.

Most RL-for-LLM systems assume centralized, batch-mode training with pre-collected datasets. OpenClaw-RL takes a fundamentally different approach: it wraps your self-hosted model in OpenClaw as an OpenAI-compatible API, intercepts live multi-turn conversations, and continuously optimizes the policy in the background — all without interrupting your usage.

🌈 Features

Fully Asynchronous 4-Component Architecture

OpenClaw-RL decouples agent serving, rollout collection, PRM judging, and policy training into independent async loops. None of them block one another — the model serves requests while training runs in the background, and PRM evaluation happens concurrently with new conversations.

Self-Hosted & Private by Design

The entire stack (model, PRM, training) runs on your own infrastructure. Conversation data never leaves your system. No external API keys required.

From Conversation to Gradient — Automatically

You don't need to manually label data. The system automatically:

Classifies API messages into main-line (trainable) vs. side (non-trainable) turns
Uses the next user/environment message as a natural "next state" signal
Runs PRM evaluation asynchronously with majority voting for robust scoring
Submits ready samples to the trainer as they become available

Two Learning Paradigms in One Framework

Binary RL (GRPO): A Process Reward Model scores each turn as good/bad/neutral based on the next-state feedback. The scalar reward is used with GRPO advantage estimation and PPO-style clipped surrogate loss.

On-Policy Distillation (OPD): When the next state reveals useful hindsight, a judge model extracts a textual hint. This hint augments the original prompt to create an "enhanced teacher," whose token-level log-probability gap with the student becomes a directional advantage signal — richer than any scalar reward.

Production-Ready Engineering

Session-aware training: Multi-turn conversations are tracked per-session with proper turn ordering
Graceful weight updates: Submission pauses during model updates, then resumes — no data corruption
At-least-one guarantee (Binary RL): Every session contributes at least one effective training sample
Hint quality filtering (OPD): Only the longest, most informative hint among m votes is selected; trivial hints are discarded
Teacher log-prob optimization (OPD): Only response-suffix log-probs are computed to reduce peak memory
Record & debug: All conversations and PRM evaluations are logged to JSONL for analysis

🎯 Roadmap

Our long-term goal is to advance personalized, practically useful agents with reinforcement learning. The roadmap has two tracks:

Track 1 — Personal Agent Optimization (Small-Scale but Personal)

✅ Release v1: Fully async OpenClaw-RL framework with Binary RL + OPD
⬜ Broader model family support & more efficient serving
⬜ Best recipe discovery via large-scale experiments
⬜ Beyond the policy: extend learning to skills and memory

Track 2 — General Agents Optimization (Scalable Infra)

⬜ Next (2–3 weeks): Scalable agentic RL infra for general agents (computer-use first)

🔧 Quick Start

1. RL Server Environment

Prerequisites

Hardware: 8× GPUs (default; configurable via NUM_GPUS, ACTOR_GPUS, ROLLOUT_GPUS, PRM_GPUS)
Software: CUDA 12.9, Python 3.12
Framework: Slime (our base RL framework)

For detailed environment setup, see Slime or ./instructions/README.md.

2. Start the RL Server

We provide two methods (RL servers):

Method	Signal Type	How It Works	When to Use
Binary RL	Scalar (+1/−1/0)	PRM judges response quality from next-state feedback via majority vote → GRPO	Abundant implicit feedback (likes, env success/failure)
On-Policy Distillation (OPD)	Token-level directional	Extract hindsight hints from next-state → construct enhanced teacher → token-level distillation	Rich textual feedback; need directional improvement

Choose your optimization method:

Option A: Binary RL — Best for implicit feedback (likes/dislikes, env success/failure)

cd slime
bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.sh

The PRM will automatically judge response quality from next-state feedback. We recommend providing frequent feedback (e.g., 👍/👎) to help the model optimize effectively.

See ./openclaw-rl/README.md for algorithm details.

Option B: On-Policy Distillation (OPD) — Best for rich textual feedback

cd slime
bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.sh

The system extracts hindsight hints from your feedback and distills them into the policy at the token level. We recommend providing concrete feedback (e.g., "you should have checked the file first" or "don't use that library").

See ./openclaw-opd/README.md for algorithm details.

Once running, the model is served as an OpenAI-compatible API at:

http://<HOST_IP>:30000/v1

where <HOST_IP> is the IP address of the machine running the RL server (e.g. 115.190.98.251). The port 30000 is the default and can be changed via the PORT environment variable.

Take note of this endpoint — you will need it when configuring OpenClaw in the next step.

3. OpenClaw Setup

Install OpenClaw from the version bundled in this repository (we will update it regularly):

Then configure OpenClaw to route requests to your RL server. Open your openclaw.json (or the equivalent settings file) and add a provider entry under "models" → "providers":

{
  "models": {
    "providers": {
      "qwen": {
        "baseUrl": "http://<HOST_IP>:30000/v1",
        "apiKey": "apiKey",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3-4b",
            "name": "Qwen3 4B",
            "reasoning": true,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 32768,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

Replace <HOST_IP> with the IP address of your RL server machine. The apiKey should match the SGLANG_API_KEY you set when starting the server.

That's it — start chatting with your OpenClaw agent. The RL server will automatically collect conversation trajectories, compute rewards, and train the model. Your agent gets better the more you use it.

Configurations

Before launching, set these important environment variables as needed:

Variable	Default	Description
`NUM_GPUS`	`8`	Total GPUs available on the machine
`ACTOR_GPUS`	`4`	GPUs allocated to the training actor
`ROLLOUT_GPUS`	`2`	GPUs allocated to rollout generation
`PRM_GPUS`	`2`	GPUs allocated to the Process Reward Model
`HF_CKPT`	(see script)	Path to the base HuggingFace checkpoint
`PRM_MODEL_PATH`	(see script)	Path to the reward model HuggingFace checkpoint
`SAVE_CKPT`	(see script)	Path to the saved HuggingFace checkpoint
`SGLANG_API_KEY`	—	API key for the SGLang serving endpoint

You can check more details about configurations in ./instructions .

📖 Citation

@misc{wang2026openclawrl,
  author       = {Wang, Yinjie and Wang, Mengdi and Yang, Ling},
  title        = {OpenClaw-RL},
  year         = {2026},
  organization = {GitHub},
  url          = {https://github.com/Gen-Verse/OpenClaw-RL},
}

@article{yu2025demystify,
  title={Demystifying Reinforcement Learning in Agentic Reasoning},
  author={Yu, Zhaochen and Yang, Ling and Zou, Jiaru and Yan, Shuicheng and Wang, Mengdi},
  journal={arXiv preprint arXiv:2510.11701},
  year={2025}
}

@article{wang2026rlanything,
  title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System},
  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2602.02488},
  year={2026}
}

🙏 Acknowledgements

This work aims to explore more effective paradigms for Agentic RL. Our implementation builds upon the excellent codebases of slime, OpenClaw and Open-AgentRL. We sincerely thank these projects for their valuable insights and high-quality implementations, which have greatly facilitated our research.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Megatron-LM		Megatron-LM
assets		assets
instructions		instructions
openclaw-opd		openclaw-opd
openclaw-rl		openclaw-rl
openclaw		openclaw
slime		slime
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenClaw-RL

📰 News

💡 TL;DR

🌈 Features

Fully Asynchronous 4-Component Architecture

Self-Hosted & Private by Design

From Conversation to Gradient — Automatically

Two Learning Paradigms in One Framework

Production-Ready Engineering

🎯 Roadmap

Track 1 — Personal Agent Optimization (Small-Scale but Personal)

Track 2 — General Agents Optimization (Scalable Infra)

🔧 Quick Start

1. RL Server Environment

Prerequisites

2. Start the RL Server

3. OpenClaw Setup

Configurations

📖 Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

Gen-Verse/OpenClaw-RL

Folders and files

Latest commit

History

Repository files navigation

OpenClaw-RL

📰 News

💡 TL;DR

🌈 Features

Fully Asynchronous 4-Component Architecture

Self-Hosted & Private by Design

From Conversation to Gradient — Automatically

Two Learning Paradigms in One Framework

Production-Ready Engineering

🎯 Roadmap

Track 1 — Personal Agent Optimization (Small-Scale but Personal)

Track 2 — General Agents Optimization (Scalable Infra)

🔧 Quick Start

1. RL Server Environment

Prerequisites

2. Start the RL Server

3. OpenClaw Setup

Configurations

📖 Citation

🙏 Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages