Skip to content
/ S2R Public

This is the official implementation of the paper "S²R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning"

License

Notifications You must be signed in to change notification settings

NineAbyss/S2R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

S²R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning

If you like our project, please give us a star ⭐ on GitHub for the latest update.

GitHub stars

This is the official implementation of the following paper:

S²R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [Paper]

The overview of S²R.

1. Environment Setup 🔧

pip install -r requirements.txt

Note: Different models require specific transformers versions:

  • Qwen2-7B-Instruct & Qwen2.5-Math-7B SFT: transformers 4.39.3
  • Llama-3.1-8B-Instruct SFT: transformers 4.44.3
  • RL: transformers 4.46.3

2. Data Collection 📚

Download the original [MATH500] and [GSM8K] dataset.

Serve the model with vLLM

python -m vllm.entrypoints.api_server  --model Qwen/Qwen2.5-Math-7B --port 8081  --tensor-parallel-size 4

Configure the model and data path in ./scripts/collect_data.sh

sh ./scripts/collect_data.sh

We also provide the SFT and RL data in data/train_data/sft_qwen2.5_math_7B.json and data/train_data/rl_data_qwen2.5.jsonl for reproducing the result of Qwen2.5-Math-7B in our paper.

We also support using our datasets via Hugging Face now!

3. SFT Training 🔥

cd ./code/scripts Configure your data and model path, then run:

sh ./code/scripts/train_sft.sh

4. Online RL Training 🚀

Configure your data and model path, then run:

sh ./code/scripts/train_rl.sh

Use the following config for outcome-level training:

--use_instance_level True
--kl_coef 0.01   # for Qwen2.5-Math_7B
--rl_data_path ./data/train_data/rl_data_qwen2.5.jsonl  # for Qwen2.5-Math_7B

Use the following config for process-level training:

--use_instance_level False
--kl_coef 0.05
--rl_data_path ./data/train_data/rl_data_qwen2.5.jsonl  # for Qwen2.5-Math_7B

5. Offline RL Training 💼

Rejection Sampling and Prompt Filtering

For offline sampling rollouts, run the following script to specify the prompt dataset (including problem and answer), the model path, and the storage path:

sh ./sample/sample_all.sh

The format of the prompt dataset should follow the reference: ./data/train_data/rl_data_offline.jsonl

Configure your data path, then run for rejection sampling and prompt filtering:

sh ./scripts/process_offline_trainset.sh

Training Script

Configure your data and model path, then run:

sh ./scripts/train_offline_rl.sh

6. Evaluation

Please refer to ./tools/qwen_eval/eval/README.md

🌟 Cite

@article{ma2025s,
  title={S $\^{} 2$ R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning},
  author={Ma, Ruotian and Wang, Peisong and Liu, Cheng and Liu, Xingyan and Chen, Jiaqi and Zhang, Bang and Zhou, Xin and Du, Nan and Li, Jia},
  journal={arXiv preprint arXiv:2502.12853},
  year={2025}
}

Acknowledgement

The code refer to huggingface/trl. The evaluation toolkit is built on QwenLM/Qwen2.5-Math. Thanks for their wonderful work.

About

This is the official implementation of the paper "S²R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published