Reflective Policy Optimization (RFPO)

Implementation of "Reflective Policy Optimization: Enhancing Reasoning in Large Language Models via Error Localization and Test-Time Self-Correction".

This repository contains the implementation of ReFlective Policy Optimization (RFPO), a novel reinforcement learning framework designed to enhance both the problem-solving and self-reflection capabilities of large language models (LLMs). RFPO introduces structured self-critique and targeted regeneration into the training loop, enabling LLMs to identify and correct their own reasoning flaws, especially in complex mathematical and logical tasks.

Requirements

To install the required dependencies for training, run:

cd RFPO
pip install -r requirements.txt

Training

To train a model using RFPO, follow the steps below:

Modify the training script at:

/RFPO/examples/grpo_trainer/run_qwen2.5-7b.sh

Update the following parameters according to your environment:

model.path: Path to the base model (e.g., Qwen2.5-7B-Instruct)
train_batch_size: Training batch size (e.g., 16)
num_gpus: Number of GPUs used (e.g., 8)
default_local_dir: Output directory to store checkpoints

Launch training:

bash /RFPO/examples/grpo_trainer/run_qwen2.5-7b.sh

Model checkpoints will be saved to the specified default_local_dir.

Evaluation

To evaluate the trained RFPO model, we recommend setting up a separate evaluation environment to avoid conflicts with training dependencies:

1. Set up evaluation environment:

cd /RFPO/tests
pip install -r requirements.txt

2. Merge FSDP checkpoint into deployable HuggingFace model format:

cd /RFPO/scripts
python model_merger.py --local_dir default_local_dir/global_step_num/actor

Replace default_local_dir/global_step_num with the actual path of your checkpoint.

3. Configure the evaluation script:

Edit the file:

/RFPO/tests/tools/scripts/evaluate.sh

Set the following parameters:

model_path: Path to the merged HuggingFace-format model
output_dir: Directory where evaluation results will be saved

4. Run evaluation:

cd /RFPO/tests/tools/scripts
bash evaluate.sh

The test results will be saved to the specified output_dir.

For more details on the RFPO framework, including algorithmic insights and benchmark results, please refer to our paper.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
datasets/gsm8k		datasets/gsm8k
docker		docker
docs		docs
examples/grpo_trainer		examples/grpo_trainer
patches		patches
scripts		scripts
verl		verl
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
Supplementary_Material.pdf		Supplementary_Material.pdf
fig1.png		fig1.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
tests.zip		tests.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reflective Policy Optimization (RFPO)

Requirements

Training

Evaluation

1. Set up evaluation environment:

2. Merge FSDP checkpoint into deployable HuggingFace model format:

3. Configure the evaluation script:

4. Run evaluation:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reflective Policy Optimization (RFPO)

Requirements

Training

Evaluation

1. Set up evaluation environment:

2. Merge FSDP checkpoint into deployable HuggingFace model format:

3. Configure the evaluation script:

4. Run evaluation:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages