Implementation of "Reflective Policy Optimization: Enhancing Reasoning in Large Language Models via Error Localization and Test-Time Self-Correction".
This repository contains the implementation of ReFlective Policy Optimization (RFPO), a novel reinforcement learning framework designed to enhance both the problem-solving and self-reflection capabilities of large language models (LLMs). RFPO introduces structured self-critique and targeted regeneration into the training loop, enabling LLMs to identify and correct their own reasoning flaws, especially in complex mathematical and logical tasks.
To install the required dependencies for training, run:
cd RFPO
pip install -r requirements.txt
To train a model using RFPO, follow the steps below:
- Modify the training script at:
/RFPO/examples/grpo_trainer/run_qwen2.5-7b.sh
Update the following parameters according to your environment:
model.path: Path to the base model (e.g., Qwen2.5-7B-Instruct)train_batch_size: Training batch size (e.g., 16)num_gpus: Number of GPUs used (e.g., 8)default_local_dir: Output directory to store checkpoints
- Launch training:
bash /RFPO/examples/grpo_trainer/run_qwen2.5-7b.shModel checkpoints will be saved to the specified default_local_dir.
To evaluate the trained RFPO model, we recommend setting up a separate evaluation environment to avoid conflicts with training dependencies:
cd /RFPO/tests
pip install -r requirements.txtcd /RFPO/scripts
python model_merger.py --local_dir default_local_dir/global_step_num/actorReplace default_local_dir/global_step_num with the actual path of your checkpoint.
Edit the file:
/RFPO/tests/tools/scripts/evaluate.sh
Set the following parameters:
model_path: Path to the merged HuggingFace-format modeloutput_dir: Directory where evaluation results will be saved
cd /RFPO/tests/tools/scripts
bash evaluate.shThe test results will be saved to the specified output_dir.
For more details on the RFPO framework, including algorithmic insights and benchmark results, please refer to our paper.
