Quality-Gated Reward Escalation — a single-GPU reinforcement learning engine for LLMs.
No TRL. No verl. No Ray. One process, one GPU, direct function calls. ~16,600 lines of Python.
Supervised fine-tuning matches outputs. The model learns to produce text that looks like the answer. It does not learn to produce text that is the answer. The distinction matters when the domain has verifiable correctness — when you can check whether a Hamiltonian satisfies Hamilton's equations, when you can differentiate symbolically to confirm self-consistency, when the ground truth is a mathematical object and not a string.
SFT teaches format. RL teaches grounding. QGRE is the RL engine.
A 1.7B parameter model (Qwen3-1.7B, 4-bit quantized) trained with QGRE on a single RTX 5080 (16 GB) derives Hamiltonian mechanics from first principles. The reward function uses math-verify and sympy to check symbolic equivalence — not string matching, not format scoring. The model writes kinetic energy, potential energy, composes the Hamiltonian, derives Hamilton's equations of motion, and maintains self-consistency. Verified by symbolic differentiation.
Step 3000+ (April 2026):
| Quality | Accuracy | Notes |
|---|---|---|
| V (potential energy) | ~95% | |
| T (kinetic energy) | ~72% | Remaining failures are correct generic formulas the reward function now recognizes |
| H (Hamiltonian) | ~85% | |
| Hamilton's equations | ~62-65% | dq/dt and dp/dt combined |
| Self-consistency | ~52% | Model's equations checked against its own H |
| Perfect score (6/6 = 1.0) | ~30% of completions |
Typical reward per step: 0.667-0.833. VRAM: 11 GB steady, zero growth over 3000+ steps. LoRA rank 32, 4-bit quantization, cosine LR schedule.
The model started from zero. No SFT warmup. No Hamiltonian mechanics in pretraining. Phase-gated curriculum with a skill tree — freefall first, then spring, then compound systems, then boss problems (driven oscillator). The model learned what a Hamiltonian is through reinforcement alone.
Here is what the model produces (step 3026, perfect score):
COORDINATES: q = y
MOMENTUM: p = m dy/dt = 4.0 dy/dt
KINETIC: T = p²/8
POTENTIAL: V = 39.2 y
HAMILTONIAN: H = p²/8 + 39.2 y
EQUATIONS:
dq/dt = ∂H/∂p = p/4
dp/dt = -∂H/∂q = -39.2
All six qualities verified: T matches p²/(2m) with m=4, V matches mgy, H = T + V, dq/dt = ∂H/∂p, dp/dt = -∂H/∂q, and the model's own equations are consistent with its own Hamiltonian. Checked by sympy, not regex.
Existing RL engines assume a specific deployment topology. verl requires 4-8 A100s and Ray. TRL requires a group of 8 completions per prompt. OpenRLHF requires a separate critic model. All three apply a single scalar reward uniformly across every token in the completion.
QGRE assumes one GPU and asks: what if credit assignment were the product, not an afterthought?
| QGRE | verl | TRL | OpenRLHF | |
|---|---|---|---|---|
| Gradient target | Per-token, per-quality, per-step | Per-completion scalar | Per-completion scalar | Per-completion + critic |
| Cold start | Phase gating + skill tree from zero | Requires SFT | Requires SFT | Requires SFT |
| Credit assignment | Segmented: each token gets the gradient for its quality | Uniform scalar | Uniform scalar | Learned critic |
| Completions per prompt | 1 (SPO) | 8-16 (GRPO) | 8 (GRPO) | 1 (PPO + critic) |
| GPU requirement | 1 x 16 GB consumer | 4-8 x A100 80 GB | 1-4 GPUs | 4-8 GPUs |
| Lines of code | ~16,600 | ~50,000 | ~30,000 | ~40,000 |
The existing engines were not built wrong. They were built for a different problem — RLHF with scalar preference signals. QGRE addresses the problem that appears when rewards are structured, verifiable, and decomposable into per-region per-quality signals.
- Python >= 3.10
- PyTorch >= 2.4.0
- CUDA GPU with >= 16 GB VRAM (tested on RTX 5080, RTX 4090)
- Unsloth >= 2026.3.5 (provides vLLM integration and 4-bit quantization)
git clone https://github.com/torad-labs/qgre-engine.git
cd qgre-engine
pip install -e ".[unsloth,hamiltonian,dev]"python -m qgre train \
--config examples/hamiltonian/config.yaml \
--reward examples.hamiltonian.reward_fn_v2:hamiltonian_rewardThis trains Qwen3-1.7B (4-bit) to derive Hamiltonian mechanics. The reward function uses math-verify and sympy for symbolic equivalence checking. Training logs to MLflow.
model:
path: unsloth/Qwen3-1.7B-unsloth-bnb-4bit
lora_rank: 32
lora_alpha: 64
load_in_4bit: true
fast_inference: true
gpu_memory_utilization: 0.25
weight_sync_strategy: merge
pad_token: "<|fim_pad|>"
pad_token_id: 151662
lora_target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
modules_to_save: [lm_head]
data:
train_files:
- path/to/train.parquet
max_prompt_length: 1024
train_batch_size: 4
prompt_column: prompt
metadata_columns: [ground_truth]
generation:
temperature: 0.7
max_tokens: 4096
stop_token_ids: [151643, 151645]
algorithm:
mode: spo
loss_type: dr_grpo
use_fused_logprobs: true
use_triton_logprobs: true
step_qualities:
1: [q_correctness]
training:
total_steps: 5000
lr: 5.0e-6
lr_scheduler: cosine
save_freq: 50from qgre.types import RewardResult
def my_reward(completion: str, metadata: dict) -> RewardResult:
"""Score a completion against ground truth.
Returns RewardResult with per-quality scores in [0, 1].
The scores dict keys must match the quality names in step_qualities.
"""
correct = check_answer(completion, metadata["ground_truth"])
return RewardResult(
reward=1.0 if correct else 0.0,
scores={"q_correctness": 1.0 if correct else 0.0},
)For fine-grained credit assignment, return scored_spans — character offsets where each quality was found:
return RewardResult(
reward=overall_score,
scores={"q_step1": 0.8, "q_step2": 1.0},
scored_spans={
"q_step1": {"start": 45, "end": 120, "score": 0.8},
"q_step2": {"start": 125, "end": 210, "score": 1.0},
},
)One process. No distributed coordination. Each step:
- Sample a prompt from the priority-weighted dataloader (curriculum-gated).
- Generate a completion via vLLM (colocated on the same GPU, MERGE weight sync).
- Score the completion with the user-provided reward function.
- Segment the completion into regions (via segmenter or scored_spans).
- Estimate per-token per-quality advantages (SPO baseline + VPRM critic + ERIC).
- Compute clipped policy gradient loss with AlignedLossFrame.
- Backward through Triton fused logprobs (forward) + PyTorch chunked (backward).
- Update LoRA parameters via AdamW8bit with cosine schedule.
- Sync weights to vLLM via MERGE (merge_adapter / unmerge_adapter).
- Checkpoint every N steps with full state serialization.
Training model and vLLM share the same GPU. The MERGE strategy eliminates ~1.9 GB of duplicate weight storage:
- Before generation:
merge_adapter()bakes LoRA weights into base model weights. - Generate: vLLM reads the merged base weights directly. No separate LoRA buffer needed.
- After generation:
unmerge_adapter()lifts LoRA weights back out for training.
This works because Unsloth's fast_inference colocates training and vLLM in the same process, sharing the same nn.Module memory. The alternative (DIRECT_COPY) copies LoRA A/B into vLLM's stacked buffers each step, requiring vLLM to maintain its own copy of base weights — 1.4 GB of duplication plus 500 MB for the LoRA buffer.
MERGE requires a Params4bit monkey-patch for PEFT 4-bit compatibility. The weight_export.py module handles this.
Qwen3-1.7B, 4-bit, LoRA rank 32, MERGE strategy:
| Component | VRAM |
|---|---|
| Model body (NF4) | ~850 MB |
| lm_head + embeddings (bf16) | ~890 MB |
| vLLM (gpu_memory_utilization=0.25) | ~2.5 GB |
| LoRA adapters (rank 32) | ~48 MB |
| Optimizer (AdamW8bit) | ~96 MB |
| Training peak | ~11 GB |
| Steady state | ~9 GB |
| VRAM growth over 3000+ steps | 0.0 GB |
VRAM stability comes from torch.cuda.empty_cache() after merge/unmerge cycles. No engine recreation needed.
One completion per prompt. Every completion teaches.
Standard GRPO generates 8-16 completions per prompt, uses the group mean as baseline, and discards completions below the mean. SPO generates one completion and maintains a persistent EMA baseline per prompt per quality per step. The baseline tracks the model's running performance — advantage is the delta between the current reward and what the model usually achieves on that prompt for that quality.
Variance-aware baseline: when reward variance drops (the model is converging), the baseline learning rate slows to preserve the gradient signal. Aspiration gap: when the baseline matches the reward, a target-aware term beta * (reward - target) preserves shaped gradients through the plateau.
Two modes of credit assignment:
Segmenter mode: A segmenter function maps each token to a named region (e.g., KINETIC, POTENTIAL, HAMILTONIAN). Each region receives the advantage computed from its associated quality scores. Built-in segmenters: uniform, qwen3_xml, label, hamiltonian.
Span mode: The reward function returns scored_spans — character offsets where each quality expression was found. The spans module (spans.py) converts character offsets to per-token boolean masks using the tokenizer's decode mapping. Each token receives the advantage for the quality whose span contains it.
The VPRM critic (critic.py) learns a per-quality per-region value function. Architecture per quality: mean_pool(hidden_states[region]) -> Linear -> ReLU -> Linear -> ReLU -> Linear(1). Polyak-averaged target network for stable baselines.
The curriculum is a DAG, not a linear sequence.
Skill tree: Each node defines a skill (e.g., freefall, spring_only, gravity_spring). Nodes declare prerequisites, mastery thresholds, regression thresholds, review probabilities, and learnability thresholds. Prompts are matched to skills via metadata (e.g., match_metadata: {system: freefall}).
Advancement: A skill advances when mastery exceeds the threshold AND learnability p(1-p) drops below the learnability threshold — the model must be both accurate and stable. Regression detection triggers review scheduling.
Tier gating: Orthogonal to the skill tree, a 2D mastery matrix gates prompt difficulty tiers. Tutorial-tracked prompts bypass tier gating (the skill tree is the authority for those prompts).
Example from the Hamiltonian config:
freefall (root) ──> spring_only ──> gravity_spring ──> driven_oscillator (boss)
└──────> damped_spring ──────┘
Four-quadrant advantage modification based on (correct/wrong) x (confident/uncertain):
| Confident (low entropy) | Uncertain (high entropy) | |
|---|---|---|
| Correct | Q2: Leave alone (learned) | Q1: Reinforce (learning) |
| Wrong | Q3: Entropy boost (shake confidence) | Q4: Flag for hints (provide direction) |
ERIC uses a chunked lm_head entropy proxy — no attention matrices needed (Unsloth's fast_inference kernels block output_attentions). Entropy tracks model commitment: low-entropy tokens are "committed anchors." ERIC dampens positive advantage on committed correct tokens (they are already learned) and boosts entropy on committed wrong tokens (they need unlearning).
Combined with position-based causal weighting (entropy_position mode): earlier tokens get higher importance weight because downstream tokens condition on them.
All tensor shift operations for advantage-logprob alignment live in one place.
Policy gradient loss requires aligning advantages (computed on completion tokens) with logprobs (shifted by one for autoregressive prediction). This is the source of an entire class of off-by-one bugs. AlignedLossFrame centralizes the coordinate system: logprob space. Shape validation at construction time. A single object carries logprobs, ref_logprobs, advantages, loss_mask, and completion_mask — all guaranteed to be aligned.
Custom Triton kernel for the forward pass. One GPU launch replaces 16 Python-to-CUDA round-trips from the chunked lm_head + checkpoint path.
The kernel tiles along the vocab dimension (BLOCK_V=128), computing hidden[t] @ lm_head.T -> logsumexp -> gather at label[t] -> logprob[t] per sequence position. Peak VRAM: hidden_dim x BLOCK_V per thread block, not seq x vocab. Zero [seq, vocab] allocation.
Backward pass uses PyTorch's chunked path (autograd through lm_head chunks). Wrapped in torch.autograd.Function for clean integration with the training loop. Falls back to the PyTorch chunked path when Triton is unavailable or vocab_size % 128 != 0.
LoRA dropout: Bernoulli masks on LoRA A matrices during generation. Partially reverts the model to base model behavior, letting suppressed knowledge surface. Linear annealing over configurable steps. Based on NoisyGRPO (NeurIPS 2025).
Dr.GRPO: Removes length normalization and standard deviation normalization from the GRPO loss (arXiv:2503.20783). Unbiased gradients.
Region-specific KL: THINK regions get low KL (explore freely), FORMAT regions get high KL (lock structure), STEP regions get normal KL.
LLDS (Log-Likelihood Divergence Smoothing): Collapse prevention from NeMo RL. Penalizes divergence between current and reference policy logprobs.
GRPO-lambda eligibility traces: Lambda-return approximation for per-token credit assignment. Backward-accumulates advantages with decay factor, giving earlier tokens credit for downstream correctness.
Length penalty: Dynamic length control — penalizes length only when group accuracy exceeds a threshold. Prevents the model from gaming reward through verbosity.
Frontier amplification: Multiplies advantages for steps that block curriculum advancement. Mastered steps get weight 1.0, frontier (blocking) steps get amplified gradients.
Aspiration gap: When the SPO baseline matches the constant partial credit (e.g., the model always scores 0.5 on a quality), vanilla advantage goes to zero. The aspiration gap term beta * (reward - target) preserves gradient through converged baselines.
LoRA-Pro: Gradient adjustment so the low-rank update better approximates full fine-tuning (ICLR 2025, arXiv:2407.18242). Solves a Sylvester equation per LoRA layer after backward.
Gradient coherence monitoring: Temporal cosine similarity between consecutive steps' gradients (convergence signal), spatial cosine between adjacent layers (disagreement signal), LoRA weight norm tracking, turbulence detection.
QGRE is configured via a single YAML file. Top-level sections:
| Field | Type | Default | Description |
|---|---|---|---|
path |
str | "" |
HuggingFace model path (required) |
lora_rank |
int | 8 |
LoRA rank |
lora_alpha |
int | 16 |
LoRA alpha scaling |
load_in_4bit |
bool | true |
4-bit quantization (NF4) |
fast_inference |
bool | true |
Unsloth fast inference (colocated vLLM) |
gpu_memory_utilization |
float | 0.35 |
vLLM GPU memory fraction |
weight_sync_strategy |
str | "direct_copy" |
"direct_copy" or "merge" |
pad_token |
str | "" |
Pad token string (required, must not be EOS) |
pad_token_id |
int | -1 |
Pad token ID (required) |
lora_target_modules |
list[str] | Qwen3 defaults | LoRA target modules |
modules_to_save |
list[str] | ["lm_head"] |
Modules trained in full precision |
max_lora_rank |
int | 0 |
Max LoRA rank for vLLM (0 = auto) |
| Field | Type | Default | Description |
|---|---|---|---|
train_files |
list[str] | [] |
Parquet file paths (required) |
max_prompt_length |
int | 3200 |
Maximum prompt token length |
train_batch_size |
int | 16 |
Prompts per training step |
prompt_column |
str | "prompt" |
Column name for prompt text |
metadata_columns |
list[str] | ["ground_truth", "extra_info"] |
Metadata columns passed to reward function |
system_prompt_column |
str | None | None |
Column for separate system message |
difficulty_column |
str | None | None |
Column for difficulty-gated curriculum |
tier_order |
list[str] | None | None |
Tier progression order |
tier_advance_threshold |
float | 0.85 |
Mastery threshold for tier advancement |
tier_advance_quality_phase |
int | 3 |
Quality phase required for advancement |
initial_tiers |
list[str] | None | None |
Starting tiers |
| Field | Type | Default | Description |
|---|---|---|---|
temperature |
float | 0.7 |
Sampling temperature |
top_p |
float | 0.8 |
Nucleus sampling threshold |
top_k |
int | 20 |
Top-k sampling |
min_p |
float | 0.1 |
Minimum probability threshold |
max_tokens |
int | 4096 |
Maximum completion tokens |
repetition_penalty |
float | 1.0 |
vLLM repetition penalty (1.0 = disabled) |
stop_token_ids |
list[int] | [] |
Stop token IDs (required per model) |
max_logprobs |
int | 5 |
vLLM max logprobs for LLDS |
lora_dropout_rate |
float | 0.0 |
LoRA A dropout rate (0.15 recommended) |
lora_dropout_anneal_steps |
int | 500 |
Linear anneal to zero over N steps |
| Field | Type | Default | Description |
|---|---|---|---|
mode |
str | "spo" |
"spo" or "grpo" |
clip_ratio_low |
float | 0.2 |
PPO clip lower bound |
clip_ratio_high |
float | 0.28 |
PPO clip upper bound |
loss_type |
str | "grpo" |
"grpo" or "dr_grpo" (unbiased) |
llds_coef |
float | 0.05 |
LLDS collapse prevention coefficient |
step_qualities |
dict | None |
{step_num: [quality_names]} mapping |
segmenter |
str | "uniform" |
Segmenter: "uniform", "qwen3_xml", "label", "hamiltonian", or "module:function" |
use_fused_logprobs |
bool | true |
Chunked lm_head logprob computation |
use_triton_logprobs |
bool | true |
Triton kernel for forward pass |
advantage_scale |
float | 1.0 |
Scale factor for advantages (0.1 recommended for 1-3B) |
min_completion_tokens |
int | 0 |
Minimum tokens before negative floor (50 recommended) |
attention_constrained_advantage |
bool | false |
Enable ERIC |
attention_constraint_strength |
float | 1.0 |
ERIC dampening multiplier |
eric_mode |
str | "entropy_position" |
"entropy", "position", or "entropy_position" |
kl_think_multiplier |
float | 0.1 |
KL weight for THINK regions |
kl_format_multiplier |
float | 2.0 |
KL weight for FORMAT regions |
kl_step_multiplier |
float | 1.0 |
KL weight for STEP regions |
lambda_return |
float | 0.0 |
Eligibility trace decay (0 = off, 0.95 = typical) |
length_penalty_coef |
float | 0.0 |
Length penalty coefficient |
length_penalty_threshold |
float | 0.5 |
Accuracy threshold for length penalty |
frontier_amplification |
float | 2.0 |
Gradient boost for blocking steps |
| Field | Type | Default | Description |
|---|---|---|---|
lr |
float | 0.1 |
EMA baseline learning rate |
n |
int | 1 |
Completions per prompt |
aspiration_beta |
float | 0.5 |
Aspiration gap strength |
aspiration_target |
float | 0.8 |
Target reward for aspiration |
var_aware |
bool | true |
Variance-aware baseline |
var_threshold |
float | 0.01 |
Variance threshold for slowdown |
staleness_window |
int | 50 |
Steps before baseline decays to prior |
| Field | Type | Default | Description |
|---|---|---|---|
total_steps |
int | 800 |
Total training steps |
lr |
float | 5e-6 |
Optimizer learning rate |
warmup_steps |
int | 10 |
LR warmup steps |
lr_scheduler |
str | "cosine" |
LR scheduler type |
save_freq |
int | 50 |
Checkpoint frequency (0 = disabled) |
gradient_accumulation_steps |
int | 1 |
Gradient accumulation |
max_grad_norm |
float | 1.0 |
Gradient clipping |
mastery_threshold |
float | 0.8 |
Phase advancement threshold |
stagnation_timeout |
int | 200 |
Steps before stagnation detection |
embedding_lr_ratio |
float | 0.1 |
lm_head LR = base LR x this |
kv_cache_flush_freq |
int | 50 |
vLLM KV cache flush frequency (0 = disabled) |
quality_window_size |
int | 20 |
Rolling window for mastery tracking |
seed |
int | -1 |
Random seed (-1 = time-based) |
log_attention_patterns |
bool | false |
Log attention entropy and collapse |
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false |
Enable VPRM learned critic |
intermediate_dim |
int | 128 |
MLP hidden dimension |
lr |
float | 1e-4 |
Critic learning rate |
clip_advantage |
float | 5.0 |
Per-quality advantage clipping |
spo_fallback_min_regions |
int | 2 |
Min regions for critic (else SPO fallback) |
polyak_tau |
float | 0.01 |
Target network update rate |
use_target_network |
bool | true |
Enable Polyak-averaged target |
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false |
Enable ERIC/EGRS 4-quadrant system |
reward_threshold |
float | 0.5 |
Score above this = "correct" |
entropy_threshold |
float | 0.5 |
Normalized entropy below this = "confident" |
gate_temperature |
float | 0.1 |
Sigmoid temperature for soft gating |
exploration_weight |
float | 0.1 |
Entropy bonus for Q3 (confident+wrong) |
hint_enabled |
bool | true |
Enable hint injection for Q4 |
hint_extractor |
str | "none" |
"hamiltonian", "generic", or "none" |
mastery_threshold |
float | 0.8 |
Mastery at which hints stop |
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false |
Enable skill tree tutorial |
post_mastery_behavior |
str | "review_only" |
"review_only", "pause", or "continue_all" |
untracked_always_active |
bool | true |
Prompts not in any skill are always available |
sequential_mastery |
bool | false |
Focus on one skill at a time |
skill_tree |
dict | {} |
Skill definitions (see below) |
Each skill in skill_tree:
| Field | Type | Default | Description |
|---|---|---|---|
prompts |
list[str] | [] |
Explicit prompt IDs |
match_metadata |
dict | None | None |
Match prompts by metadata column values |
prerequisites |
list[str] | [] |
Skills that must be mastered first |
mastery_threshold |
float | 0.8 |
Mastery score for advancement |
regression_threshold |
float | 0.6 |
Score below this triggers regression |
mastery_window |
int | 20 |
Rolling window for mastery score |
review_probability |
float | 0.15 |
Probability of review after mastery |
score_key |
str | None | None |
Quality key to track (None = overall reward) |
learnability_threshold |
float | 0.10 |
Advance when p(1-p) < this |
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false |
Enable LoRA-Pro gradient adjustment |
beta1 |
float | 0.9 |
Adam beta1 for equivalent gradient |
beta2 |
float | 0.999 |
Adam beta2 for equivalent gradient |
grad_scale |
float | 1.0 |
Post-adjustment gradient multiplier |
grad_floor |
float | 1e-7 |
Minimum gradient norm |
| Field | Type | Default | Description |
|---|---|---|---|
mlflow_experiment |
str | "qgre-training" |
MLflow experiment name |
completion_dir |
str | "output/completions" |
JSONL completion log directory |
checkpoint_dir |
str | "output/checkpoints" |
Checkpoint directory |
log_freq |
int | 5 |
Progress table frequency |
grad_log_freq |
int | 10 |
Gradient flow log frequency |
The reward function contract. Return this from your reward function.
@dataclass(frozen=True)
class RewardResult:
reward: float # Overall reward in [0, 1]
scores: dict[str, float] # Per-quality scores {quality_name: score}
scored_spans: dict | None = None # Optional: character offsets for span-based assignmentscores keys must match the quality names in step_qualities. The engine uses scores for per-quality advantage estimation and scored_spans for per-token credit assignment.
Tracks curriculum state: current phase, mastery scores, step counters, skill tree state, tier state.
@dataclass
class GameState:
phase: int = 1
mastery: dict[str, float] # Per-quality rolling mastery scores
phase_step: int = 0
total_step: int = 0
skill_mastery: dict[str, float] # Per-skill mastery scores
active_skills: set[str] # Currently trainable skills
unlocked_tiers: set[str] # Currently accessible tiers
...The centralized coordinate system for advantage-logprob alignment.
@dataclass
class AlignedLossFrame:
logprobs: torch.Tensor # [batch, seq] in logprob space
ref_logprobs: torch.Tensor # [batch, seq] reference policy
advantages: torch.Tensor # [batch, seq] aligned to logprobs
loss_mask: torch.Tensor # [batch, seq] valid positions
completion_mask: torch.Tensor # [batch, seq] completion tokensConstructed with shape validation. All tensors guaranteed to share dimensions.
class GenerationBackend(Protocol):
def generate(self, input_ids, attention_mask, **kwargs) -> GenerationOutput: ...
def set_training_mode(self) -> None: ...
def set_inference_mode(self) -> None: ...
@property
def weight_exporter(self) -> WeightExporter: ...
@property
def weight_loader(self) -> WeightLoader: ...
@property
def model(self) -> nn.Module: ...
@property
def tokenizer(self) -> Any: ...The trainer interacts with generation through this protocol. UnslothBackend is the production implementation.
504 tests: 492 CPU, 12 GPU.
# All CPU tests
pytest tests/ -m "not gpu"
# GPU tests (requires CUDA)
pytest tests/ -m "gpu"
# Specific module
pytest tests/test_advantages.py -v
# With coverage
pytest tests/ --cov=qgre --cov-report=htmlTest coverage spans every module: advantages, attention constraints, checkpoint serialization, critic networks, data loading, EGRS integration, fused logprobs, gradient coherence, Hamiltonian reward, hardening regressions, hints, LLDS, logging, LoRA-Pro, LoRA verification, NeMo extracted, schema validation, segments, spans, sync state, trainer, Triton logprobs, tutorial skill tree, weight loader lifecycle, and wiring.
The Hamiltonian reward function (examples/hamiltonian/reward_fn_v2.py) demonstrates the design principle: correctness-only scoring.
No format scoring. RL teaches WHAT (mathematical correctness). SFT teaches HOW (formatting). Format scoring is the primary reward hacking vector — the model learns to produce well-formatted wrong answers (arXiv:2602.18037).
The reward function:
- Extracts all math expressions from the completion using format-agnostic line-by-line parsing. Strips LaTeX delimiters, splits on
=, handles both LaTeX and plain text. - Parses expressions using math-verify (HuggingFace's verification library).
- Verifies symbolic equivalence via sympy. Not string matching —
p^2/(2m)andp**2/2/mare the same expression. - Falls back to substitution: if the model writes a correct generic formula (e.g.,
p^2/(2m)) but the ground truth has specific values (e.g.,p^2/10), substitutes known constants from the problem metadata and checks equivalence. - Falls back to velocity form: recognizes kinetic energy written as
mv^2/2when ground truth uses momentum formp^2/(2m). - Returns
scored_spansmapping each quality to the character offsets where the expression was found.
Six qualities: q_kinetic, q_potential, q_hamiltonian, q_dqdt, q_dpdt, q_consistency.
qgre/
__init__.py Exports: RewardResult, GameState, Segmenter, segmenters
__main__.py CLI: python -m qgre train --config ... --reward ...
types.py RewardResult (frozen), GameState, AlignedLossFrame,
TrainingContext, CheckpointState, SyncLifecycle states
config.py All config dataclasses + YAML loader + validation
trainer.py QGRETrainer: training loop, Triton conditional path,
MERGE weight sync, ERIC integration
generation.py UnslothBackend: vLLM colocation, MERGE weight sync,
LoRA dropout, hint injection protocol
advantages.py SPO + Dr.GRPO + VPRM + phase gating + EGRS 4-quadrant +
frontier amplification + aspiration gap
data.py Parquet loading, tokenization, padding, batching,
priority sampling, tier gating
segments.py Segmenters: qwen3_xml, uniform, label, custom
spans.py Character-to-token mapping for scored_spans
critic.py VPRM per-region per-dimension learned critic with
Polyak-averaged target network
checkpoint.py Full state save/resume with schema versioning and migration
fused_logprobs.py Chunked lm_head projection (PyTorch path)
triton_logprobs.py Triton kernel + torch.autograd.Function wrapper
attention_bonds.py ERIC: entropy-regulated importance constraint,
confidence gating, position-based causal weighting
attention_analysis.py Attention entropy, collapse detection, fragmentation
gradient_coherence.py Temporal/spatial cosine, LoRA weight norm, turbulence
lora_dropout.py Bernoulli dropout on LoRA A during generation
lora_pro.py LoRA-Pro gradient adjustment (Sylvester equation solver)
lora_verify.py Weight hash verification after sync
weight_bus.py MERGE/DIRECT_COPY strategy dispatcher
weight_export.py PEFT-aware weight extraction + Params4bit patch
weight_load.py vLLM weight injection with SyncLifecycle state machine
sync_state.py Unified state machine for weight sync, dropout, cache
logging.py MLflow metrics + JSONL completion logs
schema.py Declarative schema validation for checkpoint fields
hints.py Hint extraction and injection for EGRS Q4 tokens
nemo_extracted/ Apache-2.0 code from NeMo RL v0.5.0:
kl.py KL divergence (k1/k2/k3) + masked_mean
llds.py Log-Likelihood Divergence Smoothing
logits.py selective_log_softmax, logprobs_from_logits
loss_functions.py ClippedPGLossFn + eligibility traces
examples/
hamiltonian/ Hamiltonian mechanics: config, reward fn, data generator,
system prompts, 121 sympy-verified problems
math/ Math reasoning example
hypergraph/ Hypergraph reasoning example
tests/ 504 tests (492 CPU + 12 GPU)
| Feature | Module | Description |
|---|---|---|
| SPO (Single-stream Policy Optimization) | advantages.py |
n=1 persistent EMA baseline per prompt per quality per step |
| VPRM (Verifiable Process Reward Mapping) | advantages.py, critic.py, spans.py |
Segmented per-token per-quality advantage with learned critic |
| Phase-gated curriculum with skill tree | types.py, data.py, config.py |
DAG-based prerequisite mastery with regression detection |
| ERIC (Entropy-Regulated Importance Constraint) | attention_bonds.py, advantages.py |
4-quadrant advantage modification via chunked entropy proxy |
| AlignedLossFrame | types.py, trainer.py |
Centralized tensor alignment with shape validation |
| Triton fused logprobs | triton_logprobs.py |
Forward via Triton kernel, backward via PyTorch, autograd wrapper |
| MERGE weight sync | weight_bus.py, weight_export.py, weight_load.py |
Share GPU memory between training and vLLM |
| LoRA dropout | lora_dropout.py |
Bernoulli masks on LoRA A for structured exploration |
| Dr.GRPO | trainer.py, config.py |
Unbiased gradients: no length or std normalization |
| Region-specific KL | trainer.py |
THINK/FORMAT/STEP regions with different KL weights |
| LLDS (collapse prevention) | nemo_extracted/llds.py |
Log-Likelihood Divergence Smoothing from NeMo RL |
| GRPO-lambda eligibility traces | nemo_extracted/loss_functions.py |
Lambda-return credit assignment |
| Aspiration gap | advantages.py |
Gradient preservation through converged baselines |
| Variance-aware baseline | advantages.py |
Slow baseline LR when reward variance drops |
| Frontier amplification | advantages.py |
Amplified gradient on curriculum-blocking steps |
| LoRA-Pro gradient adjustment | lora_pro.py |
Sylvester equation solver for better low-rank approximation |
| Gradient coherence monitoring | gradient_coherence.py |
Temporal cosine, LoRA weight norm, turbulence detection |
| math-verify reward verification | examples/hamiltonian/reward_fn_v2.py |
Battle-tested expression parsing + symbolic equivalence |
| Substitution fallback | examples/hamiltonian/reward_fn_v2.py |
Recognize correct generic formulas with free variables |
| Hint injection (EGRS Q4) | hints.py, generation.py |
Domain-specific hints for uncertain-wrong tokens |
| Fused logprobs (PyTorch) | fused_logprobs.py |
Chunked lm_head avoids full [seq, vocab] allocation |
| Checkpoint schema migration | checkpoint.py, schema.py |
Forward-compatible state serialization with validation |
| SyncState machine | sync_state.py |
Thread-safe unified state for weight sync lifecycle |
- Single GPU, single process. By design. Multi-GPU support is a non-goal.
- On-policy. Each step generates then trains. Generation-time logprobs available via vLLM (LLDS active).
- MERGE requires Params4bit patch. PEFT's
merge_adapter()does not natively support 4-bit quantized weights.weight_export.pypatchesParams4bit.datato return the dequantized tensor. - 16 GB VRAM ceiling with MERGE. 11 GB actual for Qwen3-1.7B rank 32. Larger models require larger GPUs or lower LoRA rank.
- Unsloth dependency. Fast inference, vLLM colocation, and 4-bit LoRA integration rely on Unsloth. The
GenerationBackendprotocol abstracts this, but no alternative backend is implemented. - Unix-only sympy timeout. The Hamiltonian reward function uses
SIGALRMfor sympy timeout. Windows requires an alternative timeout mechanism. - vLLM VRAM stability depends on
torch.cuda.empty_cache()after merge/unmerge cycles. Without it, fragmentation accumulates.
# Ruff (linter + formatter)
ruff check qgre/ tests/
ruff format qgre/ tests/
# Pyright (static type checking)
pyright qgre/
# Bandit (security)
bandit -c pyproject.toml -r qgre/The project enforces strict linting via Ruff with 50+ rule sets enabled, Pyright for static type checking, and Bandit for security analysis. Configuration lives in pyproject.toml.
- GRPO: Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning (2024). The group-relative baseline that QGRE's SPO extends to n=1.
- Dr.GRPO: Liu et al., Understanding R1-Zero-Like Training (2025). Unbiased GRPO — removes length and std normalization.
- SPO: Single-stream Policy Optimization. Tencent, ICLR 2026. Persistent EMA baseline, n=1 completions.
- GDPO: Group Decomposed Policy Optimization. NVIDIA, Jan 2026. Per-step advantage normalization.
- VPRMs: Verifiable Process Rewards. IBM Research, Jan 2026. Per-region credit assignment.
- Comedy of Estimators: Bengio et al., KL Regularization in RL Training. ICLR 2026. k1/k2/k3 KL estimator selection.
- Archer: Dual-Token KL Constraints. ICLR 2026. Region-specific KL (THINK vs FORMAT vs STEP).
- LLDS: Lazy Likelihood Displacement. UBC/Vector Institute, Dec 2025. Collapse prevention.
- DAPO: Decoupled Clip and Dynamic Sampling. ByteDance, 2025. Asymmetric dual clipping.
- NoisyGRPO: Noise Injection for RL Exploration. NeurIPS 2025. LoRA dropout basis.
- Scaf-GRPO: Scaffolded Progressive Training. Feb 2026. Guidance exemption period principle.
- LoRA-Pro: Wang et al., Are Low-Rank Adapters Properly Optimized? ICLR 2025. Sylvester equation for LoRA gradient adjustment.
- Unsloth: unslothai/unsloth. Fast LoRA training with vLLM colocation.
- NeMo RL: NVIDIA NeMo RL v0.5.0 (Apache-2.0). ClippedPGLossFn, KL divergence, LLDS, selective_log_softmax.
- math-verify: HuggingFace Math-Verify. Battle-tested expression parsing and equivalence verification.
- selective_log_softmax: Romero, TRL PR #2799. 37,000x memory reduction per position.
QGRE is open source. Contributions are welcome.
Good first contributions:
- New reward functions for new domains (chemistry, code generation, legal reasoning)
- New segmenters for different output formats
- MLX backend for Mac training (same algorithm layer, different tensor ops)
- Triton backward kernel (replace chunked PyTorch backward with tiled Triton)
- Dashboard for live training visualization (the metrics are all in MLflow)
Before contributing:
- Run
bash scripts/setup-dev.shto install pre-commit hooks (ruff, pyright, bandit) - Run
pytest tests/ -qto confirm all tests pass - New features need tests. New reward functions need at least 5 test cases with known ground truth.
The design philosophy: every module has one job. The engine composes them. If your contribution requires modifying 5 files, the abstraction boundary is probably wrong. Read the existing code first — the patterns are consistent.
File an issue or open a PR at github.com/torad-labs/qgre-engine.
Apache-2.0. See LICENSE.
qgre/nemo_extracted/ contains code from NVIDIA NeMo RL (Apache-2.0), modified for single-GPU single-process operation.
Built by Torad Labs.