[experimental] feat(isb1): mechanism_eval schema — registries + hard gate for compression quality#1
Conversation
…istries + hard gate Extends the ISB1 replay result schema with a backward-compatible set of optional fields so every row declares which optimization technique it exercises (baseline, kv_quantization, kv_compression, compressed_attention, speculative_decoding) and which quality benchmark backs any lossy-technique claim. A hard gate then prevents a row from being labeled support_status= supported for a lossy technique unless a registered quality benchmark has completed. Follow-up to PR SemiAnalysisAI#1032. All new fields default to NULL (mechanism defaults to "baseline") so pre-existing rows, configs, and SQLite databases are unaffected until they opt into the mechanism_eval vocabulary. The database migration is idempotent; legacy schemas upgrade in place on first connect_db(). New files: - utils/mechanism_eval.py Env-driven field catalog (14 fields), registry loaders, validation helpers, and the row_requires_completed_quality_eval predicate. - datasets/isb1/registry/mechanism_variant_registry.json 9 registered mechanism/variant pairs covering baseline, fp8_e4m3, turboquant_class, kvtc_class, triattention_class, mtp, eagle3, medusa, dflash. - datasets/isb1/registry/quality_eval_registry.json 4 registered quality benchmarks: ruler_v1, longbench_v2, humaneval, math_500. - .github/configs/isb1-mechanism-baseline.yaml DSR1 (H100) and Qwen3.5 (B200) baseline cells. - .github/configs/isb1-mechanism-fp8-kv.yaml Same two cells with FP8 E4M3 KV quantization, wired to ruler_v1 and held at reviewed_preview until the RULER run completes (the gate blocks promotion to supported without it). - .github/workflows/run-isb1-mechanism-eval.yml Dispatch workflow routing mechanism configs through benchmark-isb1-tmpl. - utils/test_mechanism_eval.py (13 tests). - utils/test_process_result_isb1_mechanism.py (3 subprocess tests). Extended files: - utils/process_result_isb1.py — emits 14 mechanism fields + a mechanism_eval_validation record attached to every processed row. - utils/gate_isb1.py — new mechanism_compression_quality gate enforcing: (1) any non-baseline mechanism_variant must resolve in the registry; (2) quality_eval_status in {pending, completed, failed, not_required}; (3) supported + compression mechanism ⇒ quality_eval_status == completed with a registered quality_eval_id; (4) speculative_decoding ⇒ draft_model_id + speculative_acceptance_rate. - datasets/isb1/scripts/isb1_results_db.py — 16 additive ALTER TABLE migrations plus matching SCHEMA_SQL, INSERT_COLUMNS, GROUPABLE_COLUMNS, and CLI ingest flags. - utils/test_gate_isb1.py — 7 new mechanism-gate tests. Full suite: 285 passed, 2 pre-existing warnings. References — public literature the registries are grounded in: KV cache quantization (mechanism: kv_quantization) - fp8_e4m3: Micikevicius et al., "FP8 Formats for Deep Learning" (NVIDIA/Intel/Arm, 2022), arXiv:2209.05433. Defines the E4M3/E5M2 formats used by engine-native FP8 KV paths in vLLM and SGLang. - turboquant_class: umbrella slot for Hadamard-rotated 4-bit KV schemes; Hooper et al., "KVQuant", 2024, arXiv:2401.18079, is a representative reference. Specific implementation citations travel with each submitted row via mechanism_notes. KV cache compression (mechanism: kv_compression) - kvtc_class: umbrella slot for tensor-codebook / product-quantization KV compressors. The class label reflects the architecture pattern; each submitted row cites its specific implementation. Compressed attention (mechanism: compressed_attention) - triattention_class: umbrella slot for sparse-/hybrid-attention variants that change the attention-computation surface rather than the stored KV format. Speculative decoding (mechanism: speculative_decoding) - mtp: Multi-Token Prediction head as used at scale in DeepSeek-V3 (DeepSeek-AI, 2024, arXiv:2412.19437). - eagle3: EAGLE-family speculative decoding (Li et al., original EAGLE, 2024, arXiv:2401.15077; EAGLE-2 and EAGLE-3 are subsequent iterations of the same draft-model recipe). - medusa: Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads", 2024, arXiv:2401.10774. - dflash: umbrella slot for DeepFlash-style draft stacks. Quality benchmarks (quality_eval_registry.json) - ruler_v1: Hsieh et al., "RULER: What's the Real Context Size of Your Long-Context Language Models?" (NVIDIA, 2024), arXiv:2404.06654. Primary long-context retrieval signal for KV quantization and compression at 32K–1M. - longbench_v2: Bai et al., "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" (THUDM, 2024), arXiv:2412.15204. Complements RULER for reasoning-heavy long-context workloads. - humaneval: Chen et al., "Evaluating Large Language Models Trained on Code" (OpenAI Codex paper, 2021), arXiv:2107.03374. - math_500: 500-problem subset of the MATH dataset (Hendrycks et al., "Measuring Mathematical Problem Solving With the MATH Dataset", 2021, arXiv:2103.03874). Detects chain-of-thought degradation from aggressive KV quantization — the specific failure mode the hard gate is designed to catch.
There was a problem hiding this comment.
Pull request overview
Adds an experimental ISB1 “mechanism_eval” schema to classify replay results by optimization mechanism/variant and enforce a hard gate: compression mechanisms cannot be labeled support_status=supported unless a registered quality benchmark has quality_eval_status=completed.
Changes:
- Introduces
utils/mechanism_eval.pywith env-driven mechanism fields, registry loading, and validation helpers. - Extends result processing + gating + DB schema to carry/store mechanism metadata and enforce quality/speculative requirements.
- Adds registries, workflows/configs, and tests for the new mechanism_eval wiring.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
utils/mechanism_eval.py |
New mechanism_eval field catalog, registry loaders, validation, and “requires completed quality eval” predicate. |
utils/process_result_isb1.py |
Emits mechanism_eval fields on every processed row and attaches validation metadata. |
utils/gate_isb1.py |
Adds mechanism_compression_quality hard gate enforcing registry + quality-eval + speculative requirements. |
datasets/isb1/scripts/isb1_results_db.py |
Adds additive SQLite columns + migrations and CLI ingest flags for mechanism_eval fields. |
datasets/isb1/registry/mechanism_variant_registry.json |
Registry of allowed mechanism × variant pairs. |
datasets/isb1/registry/quality_eval_registry.json |
Registry of allowed quality-eval harness IDs. |
.github/workflows/run-isb1-mechanism-eval.yml |
New dispatch workflow intended to run mechanism-tagged ISB1 sweeps. |
.github/configs/isb1-mechanism-baseline.yaml |
New baseline mechanism config cells. |
.github/configs/isb1-mechanism-fp8-kv.yaml |
New FP8 KV quantization mechanism config cells (pending quality eval). |
utils/test_mechanism_eval.py |
Unit tests for field parsing, registry validation, and DB migration idempotency. |
utils/test_process_result_isb1_mechanism.py |
Subprocess integration tests for mechanism_eval fields in processed output. |
utils/test_gate_isb1.py |
Gate tests covering baseline, compression+quality requirements, and speculative requirements. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| canonical-model-id: deepseek_r1_0528 | ||
| mechanism: baseline | ||
| mechanism-variant: none |
There was a problem hiding this comment.
These top-level mechanism / mechanism-variant keys are not part of the current ISB1 config schema consumed by utils/matrix_logic/generate_sweep_configs.py isb1-sweep. load_isb1_config_files() validates configs with extra='forbid', so this file will fail matrix generation until the ISB1 master-config Pydantic model (and downstream matrix entry model) is extended to allow these fields and propagate them into the generated matrix entries.
| mechanism: kv_quantization | ||
| mechanism-variant: fp8_e4m3 | ||
| compression-method: fp8_e4m3 | ||
| compression-scope: kv_cache | ||
| quality-eval-id: ruler_v1 | ||
| quality-eval-status: pending |
There was a problem hiding this comment.
This config introduces mechanism/quality-eval keys (mechanism, mechanism-variant, compression-*, quality-eval-*) that are currently rejected by the ISB1 config validator used by generate_sweep_configs.py isb1-sweep (it forbids unknown fields). As-is, the workflow’s setup step will error during matrix generation; to make this runnable, the ISB1 master-config + matrix-entry Pydantic models and the isb1-sweep generator need to explicitly accept and forward these fields.
| offload-mode: ${{ matrix.config.offload-mode || '' }} | ||
| kv-cache-dtype: ${{ matrix.config.kv-cache-dtype || '' }} | ||
| disable-prefix-caching: ${{ matrix.config.disable-prefix-caching || '' }} | ||
| workload-type: ${{ matrix.config.workload-type || '' }} |
There was a problem hiding this comment.
The sweep job only forwards the standard ISB1 inputs into benchmark-isb1-tmpl.yml; none of the mechanism_eval fields from the config (e.g., mechanism, mechanism-variant, quality-eval-id/status, draft/speculative fields) are passed through as workflow inputs/env. As a result, process_result_isb1.py will see no MECHANISM*/QUALITY_* env vars and will default rows to mechanism="baseline", making the mechanism_eval wiring ineffective in real runs. To fix, extend the template workflow’s workflow_call inputs + env mapping and add corresponding with: entries here so these fields reach the runner environment.
| workload-type: ${{ matrix.config.workload-type || '' }} | |
| workload-type: ${{ matrix.config.workload-type || '' }} | |
| mechanism: ${{ matrix.config.mechanism || '' }} | |
| mechanism-variant: ${{ matrix.config.mechanism-variant || '' }} | |
| quality-eval-id: ${{ matrix.config.quality-eval-id || '' }} | |
| quality-eval-status: ${{ matrix.config.quality-eval-status || '' }} | |
| draft-model: ${{ matrix.config.draft-model || '' }} | |
| draft-model-prefix: ${{ matrix.config.draft-model-prefix || '' }} | |
| speculative-model: ${{ matrix.config.speculative-model || '' }} | |
| speculative-model-prefix: ${{ matrix.config.speculative-model-prefix || '' }} |
Summary
Stacked on top of PR #1032 (branch
isb1/kv-cache-stress-benchmark). This PR adds only the mechanism_eval schema on top of PR SemiAnalysisAI#1032's ISB1 baseline — it does not re-include PR SemiAnalysisAI#1032's commits.Single commit:
5c6b82f3 feat(isb1): add mechanism_eval schema — mechanism/variant/quality registries + hard gateWhat it adds
Extends the ISB1 replay result schema with a backward-compatible set of optional fields so every row declares which optimization technique it exercises (baseline, kv_quantization, kv_compression, compressed_attention, speculative_decoding) and which quality benchmark backs any lossy-technique claim. A hard gate then prevents a row from being labeled
support_status=supportedfor a lossy technique unless a registered quality benchmark has completed.All new fields default to NULL (
mechanismdefaults to"baseline") so pre-existing rows, configs, and SQLite databases are unaffected until they opt into the mechanism_eval vocabulary. The database migration is idempotent; legacy schemas upgrade in place on firstconnect_db().New files
utils/mechanism_eval.py— env-driven field catalog (14 fields), registry loaders, validation helpers, and therow_requires_completed_quality_evalpredicate.datasets/isb1/registry/mechanism_variant_registry.json— 9 registered mechanism/variant pairs (baseline, fp8_e4m3, turboquant_class, kvtc_class, triattention_class, mtp, eagle3, medusa, dflash).datasets/isb1/registry/quality_eval_registry.json— 4 registered quality benchmarks (ruler_v1, longbench_v2, humaneval, math_500)..github/configs/isb1-mechanism-baseline.yaml— DSR1 (H100) and Qwen3.5 (B200) baseline cells..github/configs/isb1-mechanism-fp8-kv.yaml— same two cells with FP8 E4M3 KV quantization, wired toruler_v1and held atreviewed_previewuntil the RULER run completes (the gate blocks promotion tosupportedwithout it)..github/workflows/run-isb1-mechanism-eval.yml— dispatch workflow routing mechanism configs throughbenchmark-isb1-tmpl.utils/test_mechanism_eval.py(13 tests).utils/test_process_result_isb1_mechanism.py(3 subprocess tests).Extended files
utils/process_result_isb1.py— emits 14 mechanism fields + amechanism_eval_validationrecord attached to every processed row.utils/gate_isb1.py— newmechanism_compression_qualitygate enforcing:mechanism_variantmust resolve in the registry;quality_eval_status ∈ {pending, completed, failed, not_required};supported+ compression mechanism ⇒quality_eval_status == completedwith a registeredquality_eval_id;speculative_decoding⇒draft_model_id+speculative_acceptance_ratepresent.datasets/isb1/scripts/isb1_results_db.py— 16 additiveALTER TABLEmigrations plus matchingSCHEMA_SQL,INSERT_COLUMNS,GROUPABLE_COLUMNS, and CLI ingest flags.utils/test_gate_isb1.py— 7 new mechanism-gate tests.Tests
Full suite: 285 passed, 2 pre-existing warnings.
References — public literature the registries are grounded in
KV cache quantization (
mechanism: kv_quantization)fp8_e4m3: Micikevicius et al., "FP8 Formats for Deep Learning" (NVIDIA/Intel/Arm, 2022), arXiv:2209.05433. Defines the E4M3/E5M2 formats used by engine-native FP8 KV paths in vLLM and SGLang.turboquant_class: umbrella slot for Hadamard-rotated 4-bit KV schemes; Hooper et al., "KVQuant", 2024, arXiv:2401.18079, is a representative reference. Specific implementation citations travel with each submitted row viamechanism_notes.KV cache compression (
mechanism: kv_compression)kvtc_class: umbrella slot for tensor-codebook / product-quantization KV compressors. The class label reflects the architecture pattern; each submitted row cites its specific implementation.Compressed attention (
mechanism: compressed_attention)triattention_class: umbrella slot for sparse-/hybrid-attention variants that change the attention-computation surface rather than the stored KV format.Speculative decoding (
mechanism: speculative_decoding)mtp: Multi-Token Prediction head as used at scale in DeepSeek-V3 (DeepSeek-AI, 2024, arXiv:2412.19437).eagle3: EAGLE-family speculative decoding (Li et al., original EAGLE, 2024, arXiv:2401.15077; EAGLE-2 and EAGLE-3 are subsequent iterations of the same draft-model recipe).medusa: Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads", 2024, arXiv:2401.10774.dflash: umbrella slot for DeepFlash-style draft stacks.Quality benchmarks (
quality_eval_registry.json)ruler_v1: Hsieh et al., "RULER: What's the Real Context Size of Your Long-Context Language Models?" (NVIDIA, 2024), arXiv:2404.06654. Primary long-context retrieval signal for KV quantization and compression at 32K–1M.longbench_v2: Bai et al., "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" (THUDM, 2024), arXiv:2412.15204. Complements RULER for reasoning-heavy long-context workloads.humaneval: Chen et al., "Evaluating Large Language Models Trained on Code" (OpenAI Codex paper, 2021), arXiv:2107.03374.math_500: 500-problem subset of the MATH dataset (Hendrycks et al., "Measuring Mathematical Problem Solving With the MATH Dataset", 2021, arXiv:2103.03874). Detects chain-of-thought degradation from aggressive KV quantization — the specific failure mode the hard gate is designed to catch.Test plan
utils/test_mechanism_eval.py— 13 tests)utils/test_gate_isb1.py— 7 new + existing)utils/test_process_result_isb1_mechanism.py— 3 tests)run-isb1-mechanism-eval.ymlwithisb1-mechanism-baseline.yamlisb1-mechanism-fp8-kv.yaml— expect the hard gate to keep the row atreviewed_previewuntil a RULER eval is registered ascompleted