AwesomeCode_on_LLMReasoningRL

awesome code on LLM reasoning reinforcement learning from the beautiful world 🤯 We are not here to judge the performance of all kinds of methods; we are here to appreciate the beauty in diversity.

ReFT

ReFT: Reasoning with Reinforced Fine-Tuning (2401.08967)

RLVR: Reinforcement Learning with Verifiable Rewards

Tulu 3: Pushing Frontiers in Open Language Model Post-Training (2411.15124)

PRIME

PRIME (Process Reinforcement through IMplicit REwards), an open-source solution for online RL with process rewards. This work stems from the implicit process reward modeling (PRM) objective. built upon veRL.

TinyZero

TinyZero is a reproduction of DeepSeek R1 Zero in countdown and multiplication tasks. built upon veRL.

(Mini-R1: Philipp reproduced R1 aha moment on countdown as well. built upon trl)

open-r1

A fully open reproduction of DeepSeek-R1.🤗

simpleRL-reason

simpleRL-reason reproduces the training of DeepSeek-R1-Zero and DeepSeek-R1 for complex mathematical reasoning, starting from Qwen-2.5-Math-7B (base model), and only using 8K (query, final answer) examples from the original MATH dataset. built upon OpenRLHF.

STILL-3-1.5B-Preview

apply RL on DeepSeek-R1-Distill-Qwen-1.5B with 30k data (from MATH,NuminaMathCoT, and AIME 1983-2023). built upon OpenRLHF.

RAGEN

RAGEN is a reproduction of the DeepSeek-R1(-Zero) methods for training agentic models. They run RAGEN on Qwen-2.5-{0.5B, 3B}-{Instruct, None} and DeepSeek-R1-Distill-Qwen-1.5B, on the Gym-Sokoban task.📦 built upon veRL.

open-r1-multimodal

R1-V

verifier

huggingface/Math-Verify

rl framework

data (any ratable task could be applied)

math
- RLVR-GSM (train:7.47k; test:1.32k)
- RLVR-MATH (train:7.5k)
- NuminaMath-CoT (aops_forum+amc_aime+cn_k12+gsm8k+math+olympiads+orca_math+synthetic_amc+synthetic_math) (train:859k; test:100)
code
- code_contests (train:3.76k; val:117; test:165)
- TACO (train:25k; test:1k)
others
- RLVR-IFeval (train:15k)
mix
- Eurus-2-RL-Data (NuminaMath-CoT+APPS+CodeContests+TACO+Codeforces+cleaning and filtering) (train:481k; val:2k)
...

msg data from long-cot model (r1/qwq...)

NuminaMath-QwQ-CoT-5M
Bespoke-Stratos-17k
R1-Distill-SFT
dolphin-r1
R1-Distill-SFT
OpenThoughts-114k
SCP-116K
Magpie-Reasoning-V1-150K-CoT-QwQ
Magpie-Reasoning-V1-150K-CoT-Deepseek-R1-Llama-70B
function-calling-v0.2-with-r1-cot
s1K (It seems that it includes some eval set (such as OmniMath) ???🤯)

others

Reasoning Datasets

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
imgs		imgs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AwesomeCode_on_LLMReasoningRL

verifier

rl framework

data (any ratable task could be applied)

msg data from long-cot model (r1/qwq...)

others

About

Releases

Packages

bebetterest/AwesomeCode_on_LLMReasoningRL

Folders and files

Latest commit

History

Repository files navigation

AwesomeCode_on_LLMReasoningRL

verifier

rl framework

data (any ratable task could be applied)

msg data from long-cot model (r1/qwq...)

others

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages