Velocity-Exploiting Rank-Learning (VERL)

[🌐 Website] • [📜 Paper] • [🐱 GitHub]

Repo for "Semantic-Space Exploration and Exploitation in RLVR for LLM Reasoning"

Figure 1: Comparative analysis with the responses of DeepSeek-R1-Distill-Qwen-7B in simpleRL-reason test dataset (Level 3 to 5). (a) Traditional metrics for exploitation and exploration are constrained by negative coupling, leading to meandering progress for both capabilities. (b) Our metrics are mutually independent. (c) Training regularization with our metrics demonstrates stronger performance in both exploitation (small K) and exploration (large K).

🔥 News

[2026/04/06] 🎉 Our work is accepted as an ACL 2026 Findings paper.
[2025/10/10] 🚀 We provide the full code for training and evaluation for VERL.
[2025/09/28] 📄 Paper, repository, and website released.

👽 Analysis, Method, Results

For a brief description, please refer to our Project Page; for a detailed description, please refer to the Paper.

🔧Key Implementations

VERL extends veRL with specific components across the following modules:

verl/trainer/main_ppo.py & verl/trainer/reward_manager_versions.py

Main entry point with ray initialization
RewardManager for reward distribution

verl/trainer/metrics_calculator.py & verl/trainer/metrics_utils.py

RepresentationMetricsCalculator for metrics calculation
Hidden states metrics in metrics_utils.py

verl/trainer/ppo/ray_trainer.py

Main RL training loop: data loading, LLM rollout, model updates, evaluation, checkpointing
RL algorithm-specific advantage computation

verl/workers/fsdp_workers.py

Source of core functions called in ray_trainer.py
LLM model/optimizer initialization, generate_sequences, update_actor

VERL extends vllm with specific components across the following folder:

hidden_vllm/

Added the hidden states extraction feature
Modified from the low-level LLM model classes all the way up to the worker

🚀 Quick Start

⚙️ Setup

Our code is implemented based on simpleRL-reason. We recommend using Conda to manage your environment. We use vLLM (0.5.4) to accelerate inference. Run the following commands to setup your environment:

conda create -n verl python==3.10.16
conda activate verl
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e . 
pip3 install -r requirements.txt

⚡️ Training

We also open-source our complete training scripts for the community. We follow the training data used in simpleRL-reason.

The training process leverages Ray and vLLM for acceleration. So firstly, you need to launch the ray cluster using the command below:

# launch the master node of ray 
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8

# if you want to launch ray on more nodes, use
ray start --address {MASTER-NODE-ADDRESS}:6379  --num-gpus 8

To start training, configure the required environment variables and customize the experiment settings at the end of the train.sh script. Then, from the master node, submit the training job by running the following command:

bash train.sh

For the details of experiment settings, you can refer to here.

🪁 Evaluation

We provide a script for inference, simply config the RUN_NAME_MAP and ACTIVE_CONFIG_SET in eval.sh and run the following command:

bash eval.sh

You can also add your own test datasets to this fold.

☕️ Citation

If you find this repository helpful, please consider citing our paper:

@misc{huang2026semanticspaceexplorationexploitationrlvr,
      title={Semantic-Space Exploration and Exploitation in RLVR for LLM Reasoning}, 
      author={Fanding Huang and Guanbo Huang and Xiao Fan and Yi He and Xiao Liang and Xiao Chen and Qinting Jiang and Faisal Nadeem Khan and Jingyan Jiang and Zhi Wang},
      year={2026},
      eprint={2509.23808},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.23808}, 
}

🙏 Acknowledgement

We sincerely appreciate the outstanding work of veRL and SimpleRL-Zoo.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
docker		docker
docs		docs
examples/simplelr_math_eval		examples/simplelr_math_eval
hidden_vllm		hidden_vllm
patches		patches
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
TRAINING_CONFIG.md		TRAINING_CONFIG.md
dataset_name		dataset_name
eval.sh		eval.sh
eval_math_nodes.sh		eval_math_nodes.sh
install.sh		install.sh
launch_gradio.sh		launch_gradio.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
train.sh		train.sh
train_grpo_math_tune_ray.sh		train_grpo_math_tune_ray.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Velocity-Exploiting Rank-Learning (VERL)

🔥 News

👽 Analysis, Method, Results

🔧Key Implementations

🚀 Quick Start

⚙️ Setup

⚡️ Training

🪁 Evaluation

☕️ Citation

🙏 Acknowledgement

🌟 Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Velocity-Exploiting Rank-Learning (VERL)

🔥 News

👽 Analysis, Method, Results

🔧Key Implementations

🚀 Quick Start

⚙️ Setup

⚡️ Training

🪁 Evaluation

☕️ Citation

🙏 Acknowledgement

🌟 Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages