Skip to content

QWE-CXZ/LESS

Repository files navigation

Beyond High-Entropy Exploration: Correctness-Aware Low-Entropy Segment-Based Advantage Shaping for Reasoning LLMs

🔍 Project Overview

This project proposes LESS (Low-Entropy Segment Shaping), a correctness-aware reinforcement learning framework for reasoning LLMs. Unlike existing entropy-based methods that only focus on high-entropy exploration tokens, LESS explicitly leverages low-entropy segments—stable structural components that account for ~80% of reasoning trajectories—to optimize policy updates.

By distinguishing low-entropy segments into "correct-only", "incorrect-only", and "shared" types, LESS amplifies productive reasoning patterns, suppresses repeated errors, and preserves high-entropy exploration. Instantiated on top of GRPO, it consistently improves accuracy, stability, and robustness across mathematical reasoning tasks.

🚀 Core Features

  • Low-Entropy Segment Awareness: Identifies and modulates stable reasoning structures based on their correlation with correctness.

  • Plug-and-Play Design: Seamlessly integrates with existing RLVR algorithms (e.g., GRPO) as an advantage-shaping module.

  • Performance Boost: Outperforms strong baselines (GRPO, Forking Tokens, KL-Cov) on 6+ math benchmarks.

  • Robustness Enhancement: Reduces worst-case performance variance and raises the floor of model reliability.

  • Broad Compatibility: Works with diverse model scales (1.5B–7B) and both math-specialized/base LLMs.

📊 Key Results

  • Accuracy Improvement: Average accuracy gains of 2–4 points across Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Qwen2.5-Base-7B.

  • Challenging Tasks: Notable gains on AIME24/25 (math olympiad-level tasks) and AMC23.

  • Worst-Case Robustness: Improves worst@32 score by +6.1 (1.5B) and +7.8 (7B) points compared to vanilla GRPO.

  • Stability: Reduces response variance (std@32) across sampled rollouts for more predictable reasoning.

🛠️ Getting Started

Prerequisites

  • Dependencies: Follow verl installation guide (supports PyTorch, FSDP, Megatron-LM).

  • Inference Engine: vLLM ≥ 0.8.2 (for high-throughput rollout generation).

  • Model: Qwen2.5 family (1.5B/7B Math, 7B Base; other LLMs compatible with modification).

  • Dataset: hendrycks_math (7.5k math problems, covers algebra, geometry, number theory, etc.).

Training Configuration

Recommended hyperparameter: Minimum low-entropy segment length μ=5 (optimal for stability and accuracy).

Quick Run

For training Qwen2.5-7B on a single node (8 NVIDIA A100-40G GPUs):

cd verl
conda activate your_env
bash 7b_base.sh

📋 Acknowledgement

  • We build the RL framework on top of verl, a flexible and efficient RLHF library.

  • Inference is accelerated by vLLM with PagedAttention for high throughput.

  • Models are trained on the Qwen2.5 family, optimized for mathematical reasoning.

  • Training data is derived from hendrycks_math, a benchmark for measuring mathematical problem-solving ability.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors