Skip to content

Icamd/RFPO

Repository files navigation

Reflective Policy Optimization (RFPO)

Implementation of "Reflective Policy Optimization: Enhancing Reasoning in Large Language Models via Error Localization and Test-Time Self-Correction".

This repository contains the implementation of ReFlective Policy Optimization (RFPO), a novel reinforcement learning framework designed to enhance both the problem-solving and self-reflection capabilities of large language models (LLMs). RFPO introduces structured self-critique and targeted regeneration into the training loop, enabling LLMs to identify and correct their own reasoning flaws, especially in complex mathematical and logical tasks.

Requirements

To install the required dependencies for training, run:

cd RFPO
pip install -r requirements.txt

Training

To train a model using RFPO, follow the steps below:

  1. Modify the training script at:

/RFPO/examples/grpo_trainer/run_qwen2.5-7b.sh

Update the following parameters according to your environment:

  • model.path: Path to the base model (e.g., Qwen2.5-7B-Instruct)
  • train_batch_size: Training batch size (e.g., 16)
  • num_gpus: Number of GPUs used (e.g., 8)
  • default_local_dir: Output directory to store checkpoints
  1. Launch training:
bash /RFPO/examples/grpo_trainer/run_qwen2.5-7b.sh

Model checkpoints will be saved to the specified default_local_dir.

Evaluation

To evaluate the trained RFPO model, we recommend setting up a separate evaluation environment to avoid conflicts with training dependencies:

1. Set up evaluation environment:

cd /RFPO/tests
pip install -r requirements.txt

2. Merge FSDP checkpoint into deployable HuggingFace model format:

cd /RFPO/scripts
python model_merger.py --local_dir default_local_dir/global_step_num/actor

Replace default_local_dir/global_step_num with the actual path of your checkpoint.

3. Configure the evaluation script:

Edit the file:

/RFPO/tests/tools/scripts/evaluate.sh

Set the following parameters:

  • model_path: Path to the merged HuggingFace-format model
  • output_dir: Directory where evaluation results will be saved

4. Run evaluation:

cd /RFPO/tests/tools/scripts
bash evaluate.sh

The test results will be saved to the specified output_dir.


For more details on the RFPO framework, including algorithmic insights and benchmark results, please refer to our paper.

About

Official implementation of RFPO.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages