[experimental] feat(isb1): mechanism_eval schema — registries + hard gate for compression quality by OCWC22 · Pull Request #1 · OCWC22/InferenceX

OCWC22 · 2026-04-17T07:32:10Z

Summary

Stacked on top of PR #1032 (branch isb1/kv-cache-stress-benchmark). This PR adds only the mechanism_eval schema on top of PR SemiAnalysisAI#1032's ISB1 baseline — it does not re-include PR SemiAnalysisAI#1032's commits.

Single commit: 5c6b82f3 feat(isb1): add mechanism_eval schema — mechanism/variant/quality registries + hard gate

What it adds

Extends the ISB1 replay result schema with a backward-compatible set of optional fields so every row declares which optimization technique it exercises (baseline, kv_quantization, kv_compression, compressed_attention, speculative_decoding) and which quality benchmark backs any lossy-technique claim. A hard gate then prevents a row from being labeled support_status=supported for a lossy technique unless a registered quality benchmark has completed.

All new fields default to NULL (mechanism defaults to "baseline") so pre-existing rows, configs, and SQLite databases are unaffected until they opt into the mechanism_eval vocabulary. The database migration is idempotent; legacy schemas upgrade in place on first connect_db().

New files

utils/mechanism_eval.py — env-driven field catalog (14 fields), registry loaders, validation helpers, and the row_requires_completed_quality_eval predicate.
datasets/isb1/registry/mechanism_variant_registry.json — 9 registered mechanism/variant pairs (baseline, fp8_e4m3, turboquant_class, kvtc_class, triattention_class, mtp, eagle3, medusa, dflash).
datasets/isb1/registry/quality_eval_registry.json — 4 registered quality benchmarks (ruler_v1, longbench_v2, humaneval, math_500).
.github/configs/isb1-mechanism-baseline.yaml — DSR1 (H100) and Qwen3.5 (B200) baseline cells.
.github/configs/isb1-mechanism-fp8-kv.yaml — same two cells with FP8 E4M3 KV quantization, wired to ruler_v1 and held at reviewed_preview until the RULER run completes (the gate blocks promotion to supported without it).
.github/workflows/run-isb1-mechanism-eval.yml — dispatch workflow routing mechanism configs through benchmark-isb1-tmpl.
utils/test_mechanism_eval.py (13 tests).
utils/test_process_result_isb1_mechanism.py (3 subprocess tests).

Extended files

utils/process_result_isb1.py — emits 14 mechanism fields + a mechanism_eval_validation record attached to every processed row.
utils/gate_isb1.py — new mechanism_compression_quality gate enforcing:
1. any non-baseline mechanism_variant must resolve in the registry;
2. quality_eval_status ∈ {pending, completed, failed, not_required};
3. supported + compression mechanism ⇒ quality_eval_status == completed with a registered quality_eval_id;
4. speculative_decoding ⇒ draft_model_id + speculative_acceptance_rate present.
datasets/isb1/scripts/isb1_results_db.py — 16 additive ALTER TABLE migrations plus matching SCHEMA_SQL, INSERT_COLUMNS, GROUPABLE_COLUMNS, and CLI ingest flags.
utils/test_gate_isb1.py — 7 new mechanism-gate tests.

Tests

Full suite: 285 passed, 2 pre-existing warnings.

References — public literature the registries are grounded in

KV cache quantization (mechanism: kv_quantization)

fp8_e4m3: Micikevicius et al., "FP8 Formats for Deep Learning" (NVIDIA/Intel/Arm, 2022), arXiv:2209.05433. Defines the E4M3/E5M2 formats used by engine-native FP8 KV paths in vLLM and SGLang.
turboquant_class: umbrella slot for Hadamard-rotated 4-bit KV schemes; Hooper et al., "KVQuant", 2024, arXiv:2401.18079, is a representative reference. Specific implementation citations travel with each submitted row via mechanism_notes.

KV cache compression (mechanism: kv_compression)

kvtc_class: umbrella slot for tensor-codebook / product-quantization KV compressors. The class label reflects the architecture pattern; each submitted row cites its specific implementation.

Compressed attention (mechanism: compressed_attention)

triattention_class: umbrella slot for sparse-/hybrid-attention variants that change the attention-computation surface rather than the stored KV format.

Speculative decoding (mechanism: speculative_decoding)

mtp: Multi-Token Prediction head as used at scale in DeepSeek-V3 (DeepSeek-AI, 2024, arXiv:2412.19437).
eagle3: EAGLE-family speculative decoding (Li et al., original EAGLE, 2024, arXiv:2401.15077; EAGLE-2 and EAGLE-3 are subsequent iterations of the same draft-model recipe).
medusa: Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads", 2024, arXiv:2401.10774.
dflash: umbrella slot for DeepFlash-style draft stacks.

Quality benchmarks (quality_eval_registry.json)

ruler_v1: Hsieh et al., "RULER: What's the Real Context Size of Your Long-Context Language Models?" (NVIDIA, 2024), arXiv:2404.06654. Primary long-context retrieval signal for KV quantization and compression at 32K–1M.
longbench_v2: Bai et al., "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" (THUDM, 2024), arXiv:2412.15204. Complements RULER for reasoning-heavy long-context workloads.
humaneval: Chen et al., "Evaluating Large Language Models Trained on Code" (OpenAI Codex paper, 2021), arXiv:2107.03374.
math_500: 500-problem subset of the MATH dataset (Hendrycks et al., "Measuring Mathematical Problem Solving With the MATH Dataset", 2021, arXiv:2103.03874). Detects chain-of-thought degradation from aggressive KV quantization — the specific failure mode the hard gate is designed to catch.

Test plan

Unit tests pass (utils/test_mechanism_eval.py — 13 tests)
Gate tests pass (utils/test_gate_isb1.py — 7 new + existing)
Subprocess integration tests pass (utils/test_process_result_isb1_mechanism.py — 3 tests)
Full suite: 285/285 passing
Land PR [isb1] add converted trace corpus + kv-cache-tester contract helpers SemiAnalysisAI/InferenceX#1032 first so this PR's base exists upstream
Dry-run dispatch of run-isb1-mechanism-eval.yml with isb1-mechanism-baseline.yaml
Dry-run dispatch with isb1-mechanism-fp8-kv.yaml — expect the hard gate to keep the row at reviewed_preview until a RULER eval is registered as completed

…istries + hard gate Extends the ISB1 replay result schema with a backward-compatible set of optional fields so every row declares which optimization technique it exercises (baseline, kv_quantization, kv_compression, compressed_attention, speculative_decoding) and which quality benchmark backs any lossy-technique claim. A hard gate then prevents a row from being labeled support_status= supported for a lossy technique unless a registered quality benchmark has completed. Follow-up to PR SemiAnalysisAI#1032. All new fields default to NULL (mechanism defaults to "baseline") so pre-existing rows, configs, and SQLite databases are unaffected until they opt into the mechanism_eval vocabulary. The database migration is idempotent; legacy schemas upgrade in place on first connect_db(). New files: - utils/mechanism_eval.py Env-driven field catalog (14 fields), registry loaders, validation helpers, and the row_requires_completed_quality_eval predicate. - datasets/isb1/registry/mechanism_variant_registry.json 9 registered mechanism/variant pairs covering baseline, fp8_e4m3, turboquant_class, kvtc_class, triattention_class, mtp, eagle3, medusa, dflash. - datasets/isb1/registry/quality_eval_registry.json 4 registered quality benchmarks: ruler_v1, longbench_v2, humaneval, math_500. - .github/configs/isb1-mechanism-baseline.yaml DSR1 (H100) and Qwen3.5 (B200) baseline cells. - .github/configs/isb1-mechanism-fp8-kv.yaml Same two cells with FP8 E4M3 KV quantization, wired to ruler_v1 and held at reviewed_preview until the RULER run completes (the gate blocks promotion to supported without it). - .github/workflows/run-isb1-mechanism-eval.yml Dispatch workflow routing mechanism configs through benchmark-isb1-tmpl. - utils/test_mechanism_eval.py (13 tests). - utils/test_process_result_isb1_mechanism.py (3 subprocess tests). Extended files: - utils/process_result_isb1.py — emits 14 mechanism fields + a mechanism_eval_validation record attached to every processed row. - utils/gate_isb1.py — new mechanism_compression_quality gate enforcing: (1) any non-baseline mechanism_variant must resolve in the registry; (2) quality_eval_status in {pending, completed, failed, not_required}; (3) supported + compression mechanism ⇒ quality_eval_status == completed with a registered quality_eval_id; (4) speculative_decoding ⇒ draft_model_id + speculative_acceptance_rate. - datasets/isb1/scripts/isb1_results_db.py — 16 additive ALTER TABLE migrations plus matching SCHEMA_SQL, INSERT_COLUMNS, GROUPABLE_COLUMNS, and CLI ingest flags. - utils/test_gate_isb1.py — 7 new mechanism-gate tests. Full suite: 285 passed, 2 pre-existing warnings. References — public literature the registries are grounded in: KV cache quantization (mechanism: kv_quantization) - fp8_e4m3: Micikevicius et al., "FP8 Formats for Deep Learning" (NVIDIA/Intel/Arm, 2022), arXiv:2209.05433. Defines the E4M3/E5M2 formats used by engine-native FP8 KV paths in vLLM and SGLang. - turboquant_class: umbrella slot for Hadamard-rotated 4-bit KV schemes; Hooper et al., "KVQuant", 2024, arXiv:2401.18079, is a representative reference. Specific implementation citations travel with each submitted row via mechanism_notes. KV cache compression (mechanism: kv_compression) - kvtc_class: umbrella slot for tensor-codebook / product-quantization KV compressors. The class label reflects the architecture pattern; each submitted row cites its specific implementation. Compressed attention (mechanism: compressed_attention) - triattention_class: umbrella slot for sparse-/hybrid-attention variants that change the attention-computation surface rather than the stored KV format. Speculative decoding (mechanism: speculative_decoding) - mtp: Multi-Token Prediction head as used at scale in DeepSeek-V3 (DeepSeek-AI, 2024, arXiv:2412.19437). - eagle3: EAGLE-family speculative decoding (Li et al., original EAGLE, 2024, arXiv:2401.15077; EAGLE-2 and EAGLE-3 are subsequent iterations of the same draft-model recipe). - medusa: Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads", 2024, arXiv:2401.10774. - dflash: umbrella slot for DeepFlash-style draft stacks. Quality benchmarks (quality_eval_registry.json) - ruler_v1: Hsieh et al., "RULER: What's the Real Context Size of Your Long-Context Language Models?" (NVIDIA, 2024), arXiv:2404.06654. Primary long-context retrieval signal for KV quantization and compression at 32K–1M. - longbench_v2: Bai et al., "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" (THUDM, 2024), arXiv:2412.15204. Complements RULER for reasoning-heavy long-context workloads. - humaneval: Chen et al., "Evaluating Large Language Models Trained on Code" (OpenAI Codex paper, 2021), arXiv:2107.03374. - math_500: 500-problem subset of the MATH dataset (Hendrycks et al., "Measuring Mathematical Problem Solving With the MATH Dataset", 2021, arXiv:2103.03874). Detects chain-of-thought degradation from aggressive KV quantization — the specific failure mode the hard gate is designed to catch.

Copilot

Pull request overview

Adds an experimental ISB1 “mechanism_eval” schema to classify replay results by optimization mechanism/variant and enforce a hard gate: compression mechanisms cannot be labeled support_status=supported unless a registered quality benchmark has quality_eval_status=completed.

Changes:

Introduces utils/mechanism_eval.py with env-driven mechanism fields, registry loading, and validation helpers.
Extends result processing + gating + DB schema to carry/store mechanism metadata and enforce quality/speculative requirements.
Adds registries, workflows/configs, and tests for the new mechanism_eval wiring.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`utils/mechanism_eval.py`	New mechanism_eval field catalog, registry loaders, validation, and “requires completed quality eval” predicate.
`utils/process_result_isb1.py`	Emits mechanism_eval fields on every processed row and attaches validation metadata.
`utils/gate_isb1.py`	Adds `mechanism_compression_quality` hard gate enforcing registry + quality-eval + speculative requirements.
`datasets/isb1/scripts/isb1_results_db.py`	Adds additive SQLite columns + migrations and CLI ingest flags for mechanism_eval fields.
`datasets/isb1/registry/mechanism_variant_registry.json`	Registry of allowed mechanism × variant pairs.
`datasets/isb1/registry/quality_eval_registry.json`	Registry of allowed quality-eval harness IDs.
`.github/workflows/run-isb1-mechanism-eval.yml`	New dispatch workflow intended to run mechanism-tagged ISB1 sweeps.
`.github/configs/isb1-mechanism-baseline.yaml`	New baseline mechanism config cells.
`.github/configs/isb1-mechanism-fp8-kv.yaml`	New FP8 KV quantization mechanism config cells (pending quality eval).
`utils/test_mechanism_eval.py`	Unit tests for field parsing, registry validation, and DB migration idempotency.
`utils/test_process_result_isb1_mechanism.py`	Subprocess integration tests for mechanism_eval fields in processed output.
`utils/test_gate_isb1.py`	Gate tests covering baseline, compression+quality requirements, and speculative requirements.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T07:37:29Z

+  canonical-model-id: deepseek_r1_0528
+  mechanism: baseline
+  mechanism-variant: none


These top-level mechanism / mechanism-variant keys are not part of the current ISB1 config schema consumed by utils/matrix_logic/generate_sweep_configs.py isb1-sweep. load_isb1_config_files() validates configs with extra='forbid', so this file will fail matrix generation until the ISB1 master-config Pydantic model (and downstream matrix entry model) is extended to allow these fields and propagate them into the generated matrix entries.

Copilot · 2026-04-17T07:37:30Z

+  mechanism: kv_quantization
+  mechanism-variant: fp8_e4m3
+  compression-method: fp8_e4m3
+  compression-scope: kv_cache
+  quality-eval-id: ruler_v1
+  quality-eval-status: pending


This config introduces mechanism/quality-eval keys (mechanism, mechanism-variant, compression-*, quality-eval-*) that are currently rejected by the ISB1 config validator used by generate_sweep_configs.py isb1-sweep (it forbids unknown fields). As-is, the workflow’s setup step will error during matrix generation; to make this runnable, the ISB1 master-config + matrix-entry Pydantic models and the isb1-sweep generator need to explicitly accept and forward these fields.

Copilot · 2026-04-17T07:37:30Z

+      offload-mode: ${{ matrix.config.offload-mode || '' }}
+      kv-cache-dtype: ${{ matrix.config.kv-cache-dtype || '' }}
+      disable-prefix-caching: ${{ matrix.config.disable-prefix-caching || '' }}
+      workload-type: ${{ matrix.config.workload-type || '' }}


The sweep job only forwards the standard ISB1 inputs into benchmark-isb1-tmpl.yml; none of the mechanism_eval fields from the config (e.g., mechanism, mechanism-variant, quality-eval-id/status, draft/speculative fields) are passed through as workflow inputs/env. As a result, process_result_isb1.py will see no MECHANISM*/QUALITY_* env vars and will default rows to mechanism="baseline", making the mechanism_eval wiring ineffective in real runs. To fix, extend the template workflow’s workflow_call inputs + env mapping and add corresponding with: entries here so these fields reach the runner environment.

Suggested change

workload-type: ${{ matrix.config.workload-type || '' }}

workload-type: ${{ matrix.config.workload-type || '' }}

mechanism: ${{ matrix.config.mechanism || '' }}

mechanism-variant: ${{ matrix.config.mechanism-variant || '' }}

quality-eval-id: ${{ matrix.config.quality-eval-id || '' }}

quality-eval-status: ${{ matrix.config.quality-eval-status || '' }}

draft-model: ${{ matrix.config.draft-model || '' }}

draft-model-prefix: ${{ matrix.config.draft-model-prefix || '' }}

speculative-model: ${{ matrix.config.speculative-model || '' }}

speculative-model-prefix: ${{ matrix.config.speculative-model-prefix || '' }}

Copilot AI review requested due to automatic review settings April 17, 2026 07:32

Copilot started reviewing on behalf of OCWC22 April 17, 2026 07:32 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experimental] feat(isb1): mechanism_eval schema — registries + hard gate for compression quality#1

[experimental] feat(isb1): mechanism_eval schema — registries + hard gate for compression quality#1
OCWC22 wants to merge 1 commit intoisb1/kv-cache-stress-benchmarkfrom
isb1/mechanism-eval-schema

OCWC22 commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-      workload-type: ${{ matrix.config.workload-type || '' }}
+      workload-type: ${{ matrix.config.workload-type || '' }}
+      mechanism: ${{ matrix.config.mechanism || '' }}
+      mechanism-variant: ${{ matrix.config.mechanism-variant || '' }}
+      quality-eval-id: ${{ matrix.config.quality-eval-id || '' }}
+      quality-eval-status: ${{ matrix.config.quality-eval-status || '' }}
+      draft-model: ${{ matrix.config.draft-model || '' }}
+      draft-model-prefix: ${{ matrix.config.draft-model-prefix || '' }}
+      speculative-model: ${{ matrix.config.speculative-model || '' }}
+      speculative-model-prefix: ${{ matrix.config.speculative-model-prefix || '' }}

Conversation

OCWC22 commented Apr 17, 2026

Summary

What it adds

New files

Extended files

Tests

References — public literature the registries are grounded in

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants