Skip to content

Conversation

@vyomakesh0728
Copy link

@vyomakesh0728 vyomakesh0728 commented Jan 22, 2026

What does this PR do?

This PR adds full Atropos environment support to VERL's GRPO training pipeline, addressing issue #1782 from Nous Research.

What's included:

  • GRPO training that handles token-level advantages from Atropos environments when provided
  • VERL now spins up vLLM inference servers and registers them with the Atropos API
  • Policy weight updates are managed by VERL throughout training
  • Single launcher orchestrates the entire pipeline (Atropos API + vLLM + training)
  • Tested end-to-end on GSM8K with solid improvements in reward and accuracy

Core verl changes:

  • Integration documentation: Complete API reference and usage examples
  • GRPO advantage override support: (verl/trainer/ppo/core_algos.py) - Added token_level_advantages parameter to compute_grpo_outcome_advantage() for Atropos-provided token-level advantages
  • Multi-turn GRPO handling: (verl/trainer/ppo/ray_trainer.py) - Support for token-level advantages in compute_advantage() and multi-turn conversation masking
  • vLLM max_model_len configuration: (verl/workers/rollout/vllm_rollout/vllm_async_server.py) - Respect explicit overrides for KV cache memory control
  • Config template: (verl/trainer/config/atropos_grpo_small.yaml) - Reference GRPO configuration for Atropos integration

Main files under recipe:

  • atropos/atropos_integration.py - Atropos API client and advantage handling
  • atropos/grpo_atropos_trainer.py - GRPO trainer with token-level advantage support
  • atropos/launch_atropos_verl_services.py - Service orchestration
  • Complete docs and example configs

Atropos recipe PR Link

Test

Tested end-to-end with atropos_grpo_small.yaml on GSM8K dataset on A100 for about 300steps

qwen2 5_3b_grpo_atropos_verl_small
uv run recipe/atropos/launch_atropos_verl_services.py \
  --config recipe/atropos/config/atropos_grpo_small.yaml
  • Training shows steady improvements in reward (val-aux/openai/gsm8k/reward/mean@1) and accuracy (val-core/openai/gsm8k/acc/mean@1) over 372 steps
  • All stability metrics (KL divergence, entropy, gradient norm) remained bounded throughout training
  • Full W&B run available at wandb

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Closes #1782

@vyomakesh0728 vyomakesh0728 changed the title [feat] Atropos integration with GRPO (#1782) #5017 [feat] Atropos integration with GRPO (#1782) Jan 22, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates Atropos with VERL's GRPO training pipeline, introducing support for token-level advantages from Atropos environments. This is a significant feature addition that allows for more flexible advantage calculation. The changes are well-contained, primarily affecting the PPO trainer logic to handle these external advantages and multi-turn conversations. The inclusion of documentation for the new API and integration examples is a great addition for maintainability and usability. I've identified one critical issue with a dependency version in pyproject.toml that needs to be addressed to ensure correct installation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Atropos integration ($2500 bounty)

1 participant