This is the official implementation of the following paper:
S²R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [Paper]
The overview of S²R.
pip install -r requirements.txt
Note: Different models require specific transformers versions:
- Qwen2-7B-Instruct & Qwen2.5-Math-7B SFT: transformers 4.39.3
- Llama-3.1-8B-Instruct SFT: transformers 4.44.3
- RL: transformers 4.46.3
Download the original [MATH500] and [GSM8K] dataset.
python -m vllm.entrypoints.api_server --model Qwen/Qwen2.5-Math-7B --port 8081 --tensor-parallel-size 4
Configure the model and data path in ./scripts/collect_data.sh
sh ./scripts/collect_data.sh
We also provide the SFT and RL data in data/train_data/sft_qwen2.5_math_7B.json
and data/train_data/rl_data_qwen2.5.jsonl
for reproducing the result of Qwen2.5-Math-7B in our paper.
We also support using our datasets via Hugging Face now!
cd ./code/scripts Configure your data and model path, then run:
sh ./code/scripts/train_sft.sh
Configure your data and model path, then run:
sh ./code/scripts/train_rl.sh
Use the following config for outcome-level training:
--use_instance_level True
--kl_coef 0.01 # for Qwen2.5-Math_7B
--rl_data_path ./data/train_data/rl_data_qwen2.5.jsonl # for Qwen2.5-Math_7B
Use the following config for process-level training:
--use_instance_level False
--kl_coef 0.05
--rl_data_path ./data/train_data/rl_data_qwen2.5.jsonl # for Qwen2.5-Math_7B
For offline sampling rollouts, run the following script to specify the prompt dataset (including problem and answer), the model path, and the storage path:
sh ./sample/sample_all.sh
The format of the prompt dataset should follow the reference:
./data/train_data/rl_data_offline.jsonl
Configure your data path, then run for rejection sampling and prompt filtering:
sh ./scripts/process_offline_trainset.sh
Configure your data and model path, then run:
sh ./scripts/train_offline_rl.sh
Please refer to
./tools/qwen_eval/eval/README.md
@article{ma2025s,
title={S $\^{} 2$ R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning},
author={Ma, Ruotian and Wang, Peisong and Liu, Cheng and Liu, Xingyan and Chen, Jiaqi and Zhang, Bang and Zhou, Xin and Du, Nan and Li, Jia},
journal={arXiv preprint arXiv:2502.12853},
year={2025}
}
The code refer to huggingface/trl. The evaluation toolkit is built on QwenLM/Qwen2.5-Math. Thanks for their wonderful work.