Skip to content

awesome code on LLM reasoning reinforcement learning from the beautiful world 🤯 We are not here to judge the performance of all kinds of methods; we are here to appreciate the beauty in diversity.

Notifications You must be signed in to change notification settings

bebetterest/AwesomeCode_on_LLMReasoningRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 

Repository files navigation

AwesomeCode_on_LLMReasoningRL

awesome code on LLM reasoning reinforcement learning from the beautiful world 🤯 We are not here to judge the performance of all kinds of methods; we are here to appreciate the beauty in diversity.


ReFT: Reasoning with Reinforced Fine-Tuning (2401.08967)

ReFT-img


Tulu 3: Pushing Frontiers in Open Language Model Post-Training (2411.15124)

RLVR-img


PRIME (Process Reinforcement through IMplicit REwards), an open-source solution for online RL with process rewards. This work stems from the implicit process reward modeling (PRM) objective. built upon veRL.

RLVR-img


TinyZero is a reproduction of DeepSeek R1 Zero in countdown and multiplication tasks. built upon veRL.

(Mini-R1: Philipp reproduced R1 aha moment on countdown as well. built upon trl)


A fully open reproduction of DeepSeek-R1.🤗

open-r1


simpleRL-reason reproduces the training of DeepSeek-R1-Zero and DeepSeek-R1 for complex mathematical reasoning, starting from Qwen-2.5-Math-7B (base model), and only using 8K (query, final answer) examples from the original MATH dataset. built upon OpenRLHF.


apply RL on DeepSeek-R1-Distill-Qwen-1.5B with 30k data (from MATH,NuminaMathCoT, and AIME 1983-2023). built upon OpenRLHF.


RAGEN is a reproduction of the DeepSeek-R1(-Zero) methods for training agentic models. They run RAGEN on Qwen-2.5-{0.5B, 3B}-{Instruct, None} and DeepSeek-R1-Distill-Qwen-1.5B, on the Gym-Sokoban task.📦 built upon veRL.

RAGEN


open-r1-multimodal


verifier

rl framework

data (any ratable task could be applied)

  • math
    • RLVR-GSM (train:7.47k; test:1.32k)
    • RLVR-MATH (train:7.5k)
    • NuminaMath-CoT (aops_forum+amc_aime+cn_k12+gsm8k+math+olympiads+orca_math+synthetic_amc+synthetic_math) (train:859k; test:100)
  • code
  • others
  • mix
    • Eurus-2-RL-Data (NuminaMath-CoT+APPS+CodeContests+TACO+Codeforces+cleaning and filtering) (train:481k; val:2k)
  • ...

msg data from long-cot model (r1/qwq...)

others

About

awesome code on LLM reasoning reinforcement learning from the beautiful world 🤯 We are not here to judge the performance of all kinds of methods; we are here to appreciate the beauty in diversity.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published