[🌐 Website] • [📜 Paper] • [🐱 GitHub]
Repo for "Semantic-Space Exploration and Exploitation in RLVR for LLM Reasoning"
Figure 1: Comparative analysis with the responses of DeepSeek-R1-Distill-Qwen-7B in simpleRL-reason test dataset (Level 3 to 5). (a) Traditional metrics for exploitation and exploration are constrained by negative coupling, leading to meandering progress for both capabilities. (b) Our metrics are mutually independent. (c) Training regularization with our metrics demonstrates stronger performance in both exploitation (small K) and exploration (large K).
- [2026/04/06] 🎉 Our work is accepted as an ACL 2026 Findings paper.
- [2025/10/10] 🚀 We provide the full code for training and evaluation for VERL.
- [2025/09/28] 📄 Paper, repository, and website released.
For a brief description, please refer to our Project Page; for a detailed description, please refer to the Paper.
VERL extends veRL with specific components across the following modules:
verl/trainer/main_ppo.py & verl/trainer/reward_manager_versions.py
- Main entry point with ray initialization
RewardManagerfor reward distribution
verl/trainer/metrics_calculator.py & verl/trainer/metrics_utils.py
RepresentationMetricsCalculatorfor metrics calculation- Hidden states metrics in
metrics_utils.py
verl/trainer/ppo/ray_trainer.py
- Main RL training loop: data loading, LLM rollout, model updates, evaluation, checkpointing
- RL algorithm-specific advantage computation
- Source of core functions called in
ray_trainer.py - LLM model/optimizer initialization,
generate_sequences,update_actor
VERL extends vllm with specific components across the following folder:
- Added the hidden states extraction feature
- Modified from the low-level LLM model classes all the way up to the worker
Our code is implemented based on simpleRL-reason. We recommend using Conda to manage your environment. We use vLLM (0.5.4) to accelerate inference. Run the following commands to setup your environment:
conda create -n verl python==3.10.16
conda activate verl
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install -r requirements.txtWe also open-source our complete training scripts for the community. We follow the training data used in simpleRL-reason.
The training process leverages Ray and vLLM for acceleration. So firstly, you need to launch the ray cluster using the command below:
# launch the master node of ray
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
# if you want to launch ray on more nodes, use
ray start --address {MASTER-NODE-ADDRESS}:6379 --num-gpus 8To start training, configure the required environment variables and customize the experiment settings at the end of the train.sh script. Then, from the master node, submit the training job by running the following command:
bash train.shFor the details of experiment settings, you can refer to here.
We provide a script for inference, simply config the RUN_NAME_MAP and ACTIVE_CONFIG_SET in eval.sh and run the following command:
bash eval.shYou can also add your own test datasets to this fold.
If you find this repository helpful, please consider citing our paper:
@misc{huang2026semanticspaceexplorationexploitationrlvr,
title={Semantic-Space Exploration and Exploitation in RLVR for LLM Reasoning},
author={Fanding Huang and Guanbo Huang and Xiao Fan and Yi He and Xiao Liang and Xiao Chen and Qinting Jiang and Faisal Nadeem Khan and Jingyan Jiang and Zhi Wang},
year={2026},
eprint={2509.23808},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.23808},
}
We sincerely appreciate the outstanding work of veRL and SimpleRL-Zoo.