ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

ReForm is a reflective Autoformalization framework that enables LLMs to iteratively generate, validate, and self-correct formal mathematical statements (Lean4) through an integrated generation-validation loop.

Reflective Autoformalization Paradigm: Introduces an iterative "generate → validate → refine" cycle that enables models to autonomously identify and correct semantic errors, unifying generation and verification in a single process.
Prospective Bounded Sequence Optimization (PBSO): A novel RL algorithm designed for heterogeneous rewards at different sequence positions, enabling stable training of models with both accurate autoformalization and reliable semantic validation.
ConsistencyCheck Benchmark: 859 expert-annotated items for evaluating semantic consistency, revealing that even human experts produce errors in up to 38.5% of cases.

_{Figure 1. Performance Comparison of ReForm against state-of-the-art models.}

_{Figure 2. RL Dynamics in our PSBO process.}

💥 News

[2026-01-31] 🎉 Our ReForm has been accepted at ICLR 2026 and we have released our RL implementation. Feel free for any question.
[2025-10-31] 🎉 We release the ReForm-32B model on Hugging Face, which is more powerful than ReForm-8B.
[2025-10-29] 🎉 We release the ReForm paper, models, and ConsistencyCheck benchmark!
- 📝 Paper available on arXiv
- 🤗 Models: ReForm-8B on Hugging Face
- 🤗 ConsistencyCheck benchmark for semantic consistency evaluation

🎯 Quick Start

Step1: Download Our Models from Huggingface

Please download the following models from huggingface:

ReForm-8B
ReForm-32B

Step2: View our preprocessed test set

We provide our preprocessed test set (miniF2F, ProofNet, Putnam, and AIME 2025) for your convenience.

./data
└── test
    ├── aime2025.jsonl
    ├── minif2f_testset.jsonl
    ├── proofnet_testset.jsonl
    └── putnam_v4.jsonl

Step3: Run our Inference scripts

python ./script/reform_decode.py \
    --model-path <PATH-to-ReForm> \
    --data-path <PATH-to-test>

RL Recipe

We use slime as our RL training framework with custom modifications for PBSO (Prospective Bounded Sequence Optimization). Thanks for their wonderful RL framework ❤️.

Setup

1. Clone with Submodule

git clone --recursive https://github.com/YOUR_USERNAME/ReForm.git
cd ReForm

Or if you've already cloned without --recursive:

git submodule update --init

2. Apply Custom Patch

Our PBSO algorithm requires custom modifications to slime. Apply the patch:

cd slime
git apply ../patch/slime.patch

3. Install Dependencies

cd slime
pip install -e .

What's in the Patch?

The patch (patch/slime.patch) contains modifications for multi-step reward handling required by PBSO:

File	Modification
`slime/utils/arguments.py`	New arguments: `--custom-adv-returns-function-path`, `--reward-shaping`, `--reward-shaping-gamma`
`slime/backends/megatron_utils/loss.py`	Support for custom advantage/returns computation functions
`slime/backends/megatron_utils/data.py`	Multi-step reward logging: `task_rewards` (final step) + average rewards
`slime/backends/utils/data.py`	Added `raw_reward` field for rollout data
`slime/rollout/filter_hub/dynamic_sampling_filters.py`	New filters: `check_all_reward_nonzero_std`, `check_task_reward_nonzero_std`

Submodule Version

The slime submodule is pinned to commit c64c2606 (#278), which is compatible with our PBSO implementation.

Docker Version

You can use the docker env (docker pull slimerl/slime:20250903-v1) for PBSO.

Training Pipeline

Step 1: Convert Checkpoint

Convert your HuggingFace checkpoint to Megatron-Core format:

bash hf2mcore/hf2mcore.sh

Step 2: Launch CriticLean Server

Start the CriticLean server for semantic consistency checking:

bash launch_llm/launch_sglang.sh

Step 3: Run SFT (Optional)

If starting from a base model, run supervised fine-tuning first:

bash sft_scripts/qwen3-8B.sh

Step 4: Run PBSO Training

Run the PBSO reinforcement learning training:

bash rl_scripts/qwen3-8B-PBSO.sh

Directory Structure

ReForm/
├── lean_plugins/           # Core PBSO implementation
│   ├── reward.py           # Multi-step reward with discounted propagation
│   ├── generate.py         # Rollout generation for autoformalization
│   ├── adv_utils.py        # Advantage/returns computation
│   ├── verify_utils.py     # Lean 4 verification utilities
│   ├── format_check.py     # Output format validation
│   └── critic_lean_prompt.py  # Prompts for semantic consistency checking
├── rl_scripts/             # RL training scripts
│   └── qwen3-8B-PBSO.sh    # PBSO training script
├── sft_scripts/            # SFT training scripts
│   └── qwen3-8B.sh         # Supervised fine-tuning script
├── hf2mcore/               # Checkpoint conversion
│   └── hf2mcore.sh         # HuggingFace to Megatron-Core conversion
├── launch_llm/             # LLM server launch scripts
│   └── launch_sglang.sh    # SGLang server for CriticLean
├── patch/                  # Slime modifications
│   └── slime.patch         # PBSO patch for slime
├── slime/                  # Slime submodule (RL framework)
├── data/                   # Test datasets
└── script/                 # Inference scripts

Key PBSO Arguments

Argument	Description	Default
`--reward-shaping`	Reward shaping strategy	`discounted_with_clip`
`--reward-shaping-gamma`	Discount factor for reward propagation	`0.5`
`--custom-adv-returns-function-path`	Custom advantage/returns function	`lean_plugins.adv_utils.compute_step_advantages_and_returns`
`--custom-rm-path`	Custom reward function	`lean_plugins.reward.reward_func`

🚀 ConsistencyCheck Benchmark

ConsistencyCheck is a carefully curated dataset designed to assess how well formal mathematical statements capture the semantic intent of their natural language counterparts. This benchmark addresses the critical challenge of semantic fidelity in mathematical formalization and serves as a key evaluation component for the REFORM methodology.

✨✨ Primary Purpose: To evaluate and advance research in automated mathematical formalization, particularly focusing on semantic consistency between natural language mathematics and formal theorem proving systems.

🏗️ Data Construction

1. Data Sources

The benchmark is constructed from two established mathematical formalization datasets:

miniF2F: Zheng, K., Han, J. M., & Polu, S. (2021). Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110.
ProofNet: Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E. W., Radev, D., & Avigad, J. (2023). Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433.

2. Annotation Protocol

Two independent expert annotators compare each formal statement with its natural-language problem.
Disagreements are resolved by a third senior expert.
Each item includes human judgment (human_check) and a textual explanation (human_reason).
All Lean statements compile successfully to isolate semantic issues.

3. Data Format

Each example follows this JSON structure:

{
  "name": "problem_identifier",
  "split": "valid|test",
  "goal": "Lean4 goal statement",
  "header": "Lean4 imports and opening commands",
  "informal_statement": "Natural language problem statement",
  "formal_statement": "Formalized theorem statement",
  "human_check": "true|false",
  "human_reason": "Explanation for incorrect labels"
}

⚠️ Known Issues

During annotation, we identified several problematic informal statements:

1. miniF2F Issues:

amc12a_2011_p18: Missing specification of whether x equals zero
amc12_2000_p11: Contains only answer choices without actual problem statement

2. ProofNet Issues:

exercise_1998_a3: Incomplete condition after "such that"
exercise_1_18b: Missing specification of whether x equals zero

🚀 Usage

1. Loading the Dataset

from datasets import load_dataset

dataset = load_dataset("GuoxinChen/ConsistencyCheck")

example = dataset["test"][0]
print(example["informal_statement"])
print(example["formal_statement"])
print(example["human_check"])

2. Evaluation

To evaluate a model on this benchmark:

Generate formal statements for the natural language problems
Compare against the human_check ground truth

🌟 Community Contributions

We hope this benchmark will contribute to the broader mathematical formalization community by:

Standardized Evaluation: Providing a reliable benchmark for comparing autoformalization systems
Semantic Focus: Emphasizing semantic consistency over syntactic correctness
Quality Assurance: Highlighting common pitfalls in mathematical formalization
Research Advancement: Supporting development of more robust formalization methods

Related Community Projects:

📚 Citation

If you find ReForm useful in your research, please cite our paper and star 🌟 our repo:

@misc{chen2025reform,
      title={ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization}, 
      author={Guoxin Chen and Jing Wu and Xinjie Chen and Wayne Xin Zhao and Ruihua Song and Chengxi Li and Kai Fan and Dayiheng Liu and Minpeng Liao},
      year={2025},
      eprint={2510.24592},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.24592}, 
}

☀️ Acknowledgments

We gratefully acknowledge:

Dataset Foundation: miniF2F and ProofNet for their pioneering formalization datasets
Formal Mathematics: The Lean community (Lean, Mathlib) for their world-class theorem proving infrastructure
Training Framework: Slime for the powerful RL framework enabling our PBSO algorithm
Inference Optimization: vLLM and SGLang for blazing-fast inference capabilities

Special thanks to all contributors and the broader open-source community.

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data/test		data/test
hf2mcore		hf2mcore
images		images
launch_llm		launch_llm
lean_plugins		lean_plugins
patch		patch
rl_scripts		rl_scripts
script		script
sft_scripts		sft_scripts
slime @ c64c260		slime @ c64c260
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

License

Chen-GX/ReForm

Folders and files

Latest commit

History

Repository files navigation