Skip to content

OpenClaw-RL: Personalize openclaw simply by talking to it

License

Notifications You must be signed in to change notification settings

Gen-Verse/OpenClaw-RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OpenClaw-RL Claw-RL logo

Empowering OpenClaw with RL β€” Train a personalized agent simply by talking to it.

Fully Async Zero API Keys Personalized Auto Language Feedback

OpenClaw-RL Blog OpenClaw Plugin Slime Based License MIT

demo.mp4

πŸ“° News

  • [2026/2/26] πŸ”₯ We release OpenClaw-RL v1 β€” a fully asynchronous RL framework for training personalized AI agents from natural conversation feedback.

πŸ’‘ TL;DR

OpenClaw-RL is a fully asynchronous reinforcement learning framework that turns everyday conversations into training signals for personalized AI agents.

Most RL-for-LLM systems assume centralized, batch-mode training with pre-collected datasets. OpenClaw-RL takes a fundamentally different approach: it wraps your self-hosted model in OpenClaw as an OpenAI-compatible API, intercepts live multi-turn conversations, and continuously optimizes the policy in the background β€” all without interrupting your usage.

Overview

🌈 Features

Fully Asynchronous 4-Component Architecture

OpenClaw-RL decouples agent serving, rollout collection, PRM judging, and policy training into independent async loops. None of them block one another β€” the model serves requests while training runs in the background, and PRM evaluation happens concurrently with new conversations.

Self-Hosted & Private by Design

The entire stack (model, PRM, training) runs on your own infrastructure. Conversation data never leaves your system. No external API keys required.

From Conversation to Gradient β€” Automatically

You don't need to manually label data. The system automatically:

  • Classifies API messages into main-line (trainable) vs. side (non-trainable) turns
  • Uses the next user/environment message as a natural "next state" signal
  • Runs PRM evaluation asynchronously with majority voting for robust scoring
  • Submits ready samples to the trainer as they become available

Two Learning Paradigms in One Framework

Binary RL (GRPO): A Process Reward Model scores each turn as good/bad/neutral based on the next-state feedback. The scalar reward is used with GRPO advantage estimation and PPO-style clipped surrogate loss.

On-Policy Distillation (OPD): When the next state reveals useful hindsight, a judge model extracts a textual hint. This hint augments the original prompt to create an "enhanced teacher," whose token-level log-probability gap with the student becomes a directional advantage signal β€” richer than any scalar reward.

Production-Ready Engineering

  • Session-aware training: Multi-turn conversations are tracked per-session with proper turn ordering
  • Graceful weight updates: Submission pauses during model updates, then resumes β€” no data corruption
  • At-least-one guarantee (Binary RL): Every session contributes at least one effective training sample
  • Hint quality filtering (OPD): Only the longest, most informative hint among m votes is selected; trivial hints are discarded
  • Teacher log-prob optimization (OPD): Only response-suffix log-probs are computed to reduce peak memory
  • Record & debug: All conversations and PRM evaluations are logged to JSONL for analysis

🎯 Roadmap

Our long-term goal is to advance personalized, practically useful agents with reinforcement learning. The roadmap has two tracks:

Track 1 β€” Personal Agent Optimization (Small-Scale but Personal)

βœ… Release v1: Fully async OpenClaw-RL framework with Binary RL + OPD
⬜ Broader model family support & more efficient serving
⬜ Best recipe discovery via large-scale experiments
⬜ Beyond the policy: extend learning to skills and memory

Track 2 β€” General Agents Optimization (Scalable Infra)

⬜ Next (2–3 weeks): Scalable agentic RL infra for general agents (computer-use first)


πŸ”§ Quick Start

1. RL Server Environment

Prerequisites

  • Hardware: 8Γ— GPUs (default; configurable via NUM_GPUS, ACTOR_GPUS, ROLLOUT_GPUS, PRM_GPUS)
  • Software: CUDA 12.9, Python 3.12
  • Framework: Slime (our base RL framework)

For detailed environment setup, see Slime or ./instructions/README.md.

2. Start the RL Server

We provide two methods (RL servers):

Method Signal Type How It Works When to Use
Binary RL Scalar (+1/βˆ’1/0) PRM judges response quality from next-state feedback via majority vote β†’ GRPO Abundant implicit feedback (likes, env success/failure)
On-Policy Distillation (OPD) Token-level directional Extract hindsight hints from next-state β†’ construct enhanced teacher β†’ token-level distillation Rich textual feedback; need directional improvement

Choose your optimization method:

Option A: Binary RL β€” Best for implicit feedback (likes/dislikes, env success/failure)
cd slime
bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.sh

The PRM will automatically judge response quality from next-state feedback. We recommend providing frequent feedback (e.g., πŸ‘/πŸ‘Ž) to help the model optimize effectively.

See ./openclaw-rl/README.md for algorithm details.

Option B: On-Policy Distillation (OPD) β€” Best for rich textual feedback
cd slime
bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.sh

The system extracts hindsight hints from your feedback and distills them into the policy at the token level. We recommend providing concrete feedback (e.g., "you should have checked the file first" or "don't use that library").

See ./openclaw-opd/README.md for algorithm details.

Once running, the model is served as an OpenAI-compatible API at:

http://<HOST_IP>:30000/v1

where <HOST_IP> is the IP address of the machine running the RL server (e.g. 115.190.98.251). The port 30000 is the default and can be changed via the PORT environment variable.

Take note of this endpoint β€” you will need it when configuring OpenClaw in the next step.

3. OpenClaw Setup

Install OpenClaw from the version bundled in this repository (we will update it regularly):

Then configure OpenClaw to route requests to your RL server. Open your openclaw.json (or the equivalent settings file) and add a provider entry under "models" β†’ "providers":

{
  "models": {
    "providers": {
      "qwen": {
        "baseUrl": "http://<HOST_IP>:30000/v1",
        "apiKey": "apiKey",
        "api": "openai-completions",
        "models": [
          {
            "id": "qwen3-4b",
            "name": "Qwen3 4B",
            "reasoning": true,
            "input": ["text"],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 32768,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

Replace <HOST_IP> with the IP address of your RL server machine. The apiKey should match the SGLANG_API_KEY you set when starting the server.

That's it β€” start chatting with your OpenClaw agent. The RL server will automatically collect conversation trajectories, compute rewards, and train the model. Your agent gets better the more you use it.

Configurations

Before launching, set these important environment variables as needed:

Variable Default Description
NUM_GPUS 8 Total GPUs available on the machine
ACTOR_GPUS 4 GPUs allocated to the training actor
ROLLOUT_GPUS 2 GPUs allocated to rollout generation
PRM_GPUS 2 GPUs allocated to the Process Reward Model
HF_CKPT (see script) Path to the base HuggingFace checkpoint
PRM_MODEL_PATH (see script) Path to the reward model HuggingFace checkpoint
SAVE_CKPT (see script) Path to the saved HuggingFace checkpoint
SGLANG_API_KEY β€” API key for the SGLang serving endpoint

You can check more details about configurations in ./instructions .

πŸ“– Citation

@misc{wang2026openclawrl,
  author       = {Wang, Yinjie and Wang, Mengdi and Yang, Ling},
  title        = {OpenClaw-RL},
  year         = {2026},
  organization = {GitHub},
  url          = {https://github.com/Gen-Verse/OpenClaw-RL},
}

@article{yu2025demystify,
  title={Demystifying Reinforcement Learning in Agentic Reasoning},
  author={Yu, Zhaochen and Yang, Ling and Zou, Jiaru and Yan, Shuicheng and Wang, Mengdi},
  journal={arXiv preprint arXiv:2510.11701},
  year={2025}
}

@article{wang2026rlanything,
  title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System},
  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling},
  journal={arXiv preprint arXiv:2602.02488},
  year={2026}
}

πŸ™ Acknowledgements

This work aims to explore more effective paradigms for Agentic RL. Our implementation builds upon the excellent codebases of slime, OpenClaw and Open-AgentRL. We sincerely thank these projects for their valuable insights and high-quality implementations, which have greatly facilitated our research.


Releases

No releases published

Packages

 
 
 

Contributors