awesome code on LLM reasoning reinforcement learning from the beautiful world 🤯 We are not here to judge the performance of all kinds of methods; we are here to appreciate the beauty in diversity.
ReFT: Reasoning with Reinforced Fine-Tuning (2401.08967)
Tulu 3: Pushing Frontiers in Open Language Model Post-Training (2411.15124)
PRIME (Process Reinforcement through IMplicit REwards), an open-source solution for online RL with process rewards. This work stems from the implicit process reward modeling (PRM) objective. built upon veRL.
TinyZero is a reproduction of DeepSeek R1 Zero in countdown and multiplication tasks. built upon veRL.
(Mini-R1: Philipp reproduced R1 aha moment on countdown as well. built upon trl)
A fully open reproduction of DeepSeek-R1.🤗
simpleRL-reason reproduces the training of DeepSeek-R1-Zero and DeepSeek-R1 for complex mathematical reasoning, starting from Qwen-2.5-Math-7B (base model), and only using 8K (query, final answer) examples from the original MATH dataset. built upon OpenRLHF.
apply RL on DeepSeek-R1-Distill-Qwen-1.5B with 30k data (from MATH,NuminaMathCoT, and AIME 1983-2023). built upon OpenRLHF.
RAGEN is a reproduction of the DeepSeek-R1(-Zero) methods for training agentic models. They run RAGEN on Qwen-2.5-{0.5B, 3B}-{Instruct, None} and DeepSeek-R1-Distill-Qwen-1.5B, on the Gym-Sokoban task.📦 built upon veRL.
- math
- RLVR-GSM (train:7.47k; test:1.32k)
- RLVR-MATH (train:7.5k)
- NuminaMath-CoT (aops_forum+amc_aime+cn_k12+gsm8k+math+olympiads+orca_math+synthetic_amc+synthetic_math) (train:859k; test:100)
- code
- code_contests (train:3.76k; val:117; test:165)
- TACO (train:25k; test:1k)
- others
- RLVR-IFeval (train:15k)
- mix
- Eurus-2-RL-Data (NuminaMath-CoT+APPS+CodeContests+TACO+Codeforces+cleaning and filtering) (train:481k; val:2k)
- ...
- NuminaMath-QwQ-CoT-5M
- Bespoke-Stratos-17k
- R1-Distill-SFT
- dolphin-r1
- R1-Distill-SFT
- OpenThoughts-114k
- SCP-116K
- Magpie-Reasoning-V1-150K-CoT-QwQ
- Magpie-Reasoning-V1-150K-CoT-Deepseek-R1-Llama-70B
- function-calling-v0.2-with-r1-cot
- s1K (It seems that it includes some eval set (such as OmniMath) ???🤯)