Skip to content

[experimental] feat(isb1): mechanism_eval schema — registries + hard gate for compression quality#1

Open
OCWC22 wants to merge 1 commit intoisb1/kv-cache-stress-benchmarkfrom
isb1/mechanism-eval-schema
Open

[experimental] feat(isb1): mechanism_eval schema — registries + hard gate for compression quality#1
OCWC22 wants to merge 1 commit intoisb1/kv-cache-stress-benchmarkfrom
isb1/mechanism-eval-schema

Conversation

@OCWC22
Copy link
Copy Markdown
Owner

@OCWC22 OCWC22 commented Apr 17, 2026

Summary

Stacked on top of PR #1032 (branch isb1/kv-cache-stress-benchmark). This PR adds only the mechanism_eval schema on top of PR SemiAnalysisAI#1032's ISB1 baseline — it does not re-include PR SemiAnalysisAI#1032's commits.

Single commit: 5c6b82f3 feat(isb1): add mechanism_eval schema — mechanism/variant/quality registries + hard gate

What it adds

Extends the ISB1 replay result schema with a backward-compatible set of optional fields so every row declares which optimization technique it exercises (baseline, kv_quantization, kv_compression, compressed_attention, speculative_decoding) and which quality benchmark backs any lossy-technique claim. A hard gate then prevents a row from being labeled support_status=supported for a lossy technique unless a registered quality benchmark has completed.

All new fields default to NULL (mechanism defaults to "baseline") so pre-existing rows, configs, and SQLite databases are unaffected until they opt into the mechanism_eval vocabulary. The database migration is idempotent; legacy schemas upgrade in place on first connect_db().

New files

  • utils/mechanism_eval.py — env-driven field catalog (14 fields), registry loaders, validation helpers, and the row_requires_completed_quality_eval predicate.
  • datasets/isb1/registry/mechanism_variant_registry.json — 9 registered mechanism/variant pairs (baseline, fp8_e4m3, turboquant_class, kvtc_class, triattention_class, mtp, eagle3, medusa, dflash).
  • datasets/isb1/registry/quality_eval_registry.json — 4 registered quality benchmarks (ruler_v1, longbench_v2, humaneval, math_500).
  • .github/configs/isb1-mechanism-baseline.yaml — DSR1 (H100) and Qwen3.5 (B200) baseline cells.
  • .github/configs/isb1-mechanism-fp8-kv.yaml — same two cells with FP8 E4M3 KV quantization, wired to ruler_v1 and held at reviewed_preview until the RULER run completes (the gate blocks promotion to supported without it).
  • .github/workflows/run-isb1-mechanism-eval.yml — dispatch workflow routing mechanism configs through benchmark-isb1-tmpl.
  • utils/test_mechanism_eval.py (13 tests).
  • utils/test_process_result_isb1_mechanism.py (3 subprocess tests).

Extended files

  • utils/process_result_isb1.py — emits 14 mechanism fields + a mechanism_eval_validation record attached to every processed row.
  • utils/gate_isb1.py — new mechanism_compression_quality gate enforcing:
    1. any non-baseline mechanism_variant must resolve in the registry;
    2. quality_eval_status ∈ {pending, completed, failed, not_required};
    3. supported + compression mechanism ⇒ quality_eval_status == completed with a registered quality_eval_id;
    4. speculative_decodingdraft_model_id + speculative_acceptance_rate present.
  • datasets/isb1/scripts/isb1_results_db.py — 16 additive ALTER TABLE migrations plus matching SCHEMA_SQL, INSERT_COLUMNS, GROUPABLE_COLUMNS, and CLI ingest flags.
  • utils/test_gate_isb1.py — 7 new mechanism-gate tests.

Tests

Full suite: 285 passed, 2 pre-existing warnings.

References — public literature the registries are grounded in

KV cache quantization (mechanism: kv_quantization)

  • fp8_e4m3: Micikevicius et al., "FP8 Formats for Deep Learning" (NVIDIA/Intel/Arm, 2022), arXiv:2209.05433. Defines the E4M3/E5M2 formats used by engine-native FP8 KV paths in vLLM and SGLang.
  • turboquant_class: umbrella slot for Hadamard-rotated 4-bit KV schemes; Hooper et al., "KVQuant", 2024, arXiv:2401.18079, is a representative reference. Specific implementation citations travel with each submitted row via mechanism_notes.

KV cache compression (mechanism: kv_compression)

  • kvtc_class: umbrella slot for tensor-codebook / product-quantization KV compressors. The class label reflects the architecture pattern; each submitted row cites its specific implementation.

Compressed attention (mechanism: compressed_attention)

  • triattention_class: umbrella slot for sparse-/hybrid-attention variants that change the attention-computation surface rather than the stored KV format.

Speculative decoding (mechanism: speculative_decoding)

  • mtp: Multi-Token Prediction head as used at scale in DeepSeek-V3 (DeepSeek-AI, 2024, arXiv:2412.19437).
  • eagle3: EAGLE-family speculative decoding (Li et al., original EAGLE, 2024, arXiv:2401.15077; EAGLE-2 and EAGLE-3 are subsequent iterations of the same draft-model recipe).
  • medusa: Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads", 2024, arXiv:2401.10774.
  • dflash: umbrella slot for DeepFlash-style draft stacks.

Quality benchmarks (quality_eval_registry.json)

  • ruler_v1: Hsieh et al., "RULER: What's the Real Context Size of Your Long-Context Language Models?" (NVIDIA, 2024), arXiv:2404.06654. Primary long-context retrieval signal for KV quantization and compression at 32K–1M.
  • longbench_v2: Bai et al., "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" (THUDM, 2024), arXiv:2412.15204. Complements RULER for reasoning-heavy long-context workloads.
  • humaneval: Chen et al., "Evaluating Large Language Models Trained on Code" (OpenAI Codex paper, 2021), arXiv:2107.03374.
  • math_500: 500-problem subset of the MATH dataset (Hendrycks et al., "Measuring Mathematical Problem Solving With the MATH Dataset", 2021, arXiv:2103.03874). Detects chain-of-thought degradation from aggressive KV quantization — the specific failure mode the hard gate is designed to catch.

Test plan

  • Unit tests pass (utils/test_mechanism_eval.py — 13 tests)
  • Gate tests pass (utils/test_gate_isb1.py — 7 new + existing)
  • Subprocess integration tests pass (utils/test_process_result_isb1_mechanism.py — 3 tests)
  • Full suite: 285/285 passing
  • Land PR [isb1] add converted trace corpus + kv-cache-tester contract helpers SemiAnalysisAI/InferenceX#1032 first so this PR's base exists upstream
  • Dry-run dispatch of run-isb1-mechanism-eval.yml with isb1-mechanism-baseline.yaml
  • Dry-run dispatch with isb1-mechanism-fp8-kv.yaml — expect the hard gate to keep the row at reviewed_preview until a RULER eval is registered as completed

…istries + hard gate

Extends the ISB1 replay result schema with a backward-compatible set of
optional fields so every row declares which optimization technique it
exercises (baseline, kv_quantization, kv_compression, compressed_attention,
speculative_decoding) and which quality benchmark backs any lossy-technique
claim. A hard gate then prevents a row from being labeled support_status=
supported for a lossy technique unless a registered quality benchmark has
completed.

Follow-up to PR SemiAnalysisAI#1032.

All new fields default to NULL (mechanism defaults to "baseline") so
pre-existing rows, configs, and SQLite databases are unaffected until they
opt into the mechanism_eval vocabulary. The database migration is
idempotent; legacy schemas upgrade in place on first connect_db().

New files:
- utils/mechanism_eval.py
  Env-driven field catalog (14 fields), registry loaders, validation
  helpers, and the row_requires_completed_quality_eval predicate.
- datasets/isb1/registry/mechanism_variant_registry.json
  9 registered mechanism/variant pairs covering baseline, fp8_e4m3,
  turboquant_class, kvtc_class, triattention_class, mtp, eagle3, medusa,
  dflash.
- datasets/isb1/registry/quality_eval_registry.json
  4 registered quality benchmarks: ruler_v1, longbench_v2, humaneval,
  math_500.
- .github/configs/isb1-mechanism-baseline.yaml
  DSR1 (H100) and Qwen3.5 (B200) baseline cells.
- .github/configs/isb1-mechanism-fp8-kv.yaml
  Same two cells with FP8 E4M3 KV quantization, wired to ruler_v1 and
  held at reviewed_preview until the RULER run completes (the gate
  blocks promotion to supported without it).
- .github/workflows/run-isb1-mechanism-eval.yml
  Dispatch workflow routing mechanism configs through benchmark-isb1-tmpl.
- utils/test_mechanism_eval.py (13 tests).
- utils/test_process_result_isb1_mechanism.py (3 subprocess tests).

Extended files:
- utils/process_result_isb1.py — emits 14 mechanism fields + a
  mechanism_eval_validation record attached to every processed row.
- utils/gate_isb1.py — new mechanism_compression_quality gate enforcing:
  (1) any non-baseline mechanism_variant must resolve in the registry;
  (2) quality_eval_status in {pending, completed, failed, not_required};
  (3) supported + compression mechanism ⇒ quality_eval_status == completed
      with a registered quality_eval_id;
  (4) speculative_decoding ⇒ draft_model_id + speculative_acceptance_rate.
- datasets/isb1/scripts/isb1_results_db.py — 16 additive ALTER TABLE
  migrations plus matching SCHEMA_SQL, INSERT_COLUMNS, GROUPABLE_COLUMNS,
  and CLI ingest flags.
- utils/test_gate_isb1.py — 7 new mechanism-gate tests.

Full suite: 285 passed, 2 pre-existing warnings.

References — public literature the registries are grounded in:

KV cache quantization (mechanism: kv_quantization)
- fp8_e4m3: Micikevicius et al., "FP8 Formats for Deep Learning"
  (NVIDIA/Intel/Arm, 2022), arXiv:2209.05433. Defines the E4M3/E5M2
  formats used by engine-native FP8 KV paths in vLLM and SGLang.
- turboquant_class: umbrella slot for Hadamard-rotated 4-bit KV
  schemes; Hooper et al., "KVQuant", 2024, arXiv:2401.18079, is a
  representative reference. Specific implementation citations travel
  with each submitted row via mechanism_notes.

KV cache compression (mechanism: kv_compression)
- kvtc_class: umbrella slot for tensor-codebook / product-quantization
  KV compressors. The class label reflects the architecture pattern;
  each submitted row cites its specific implementation.

Compressed attention (mechanism: compressed_attention)
- triattention_class: umbrella slot for sparse-/hybrid-attention
  variants that change the attention-computation surface rather than
  the stored KV format.

Speculative decoding (mechanism: speculative_decoding)
- mtp: Multi-Token Prediction head as used at scale in DeepSeek-V3
  (DeepSeek-AI, 2024, arXiv:2412.19437).
- eagle3: EAGLE-family speculative decoding (Li et al., original
  EAGLE, 2024, arXiv:2401.15077; EAGLE-2 and EAGLE-3 are subsequent
  iterations of the same draft-model recipe).
- medusa: Cai et al., "Medusa: Simple LLM Inference Acceleration
  Framework with Multiple Decoding Heads", 2024, arXiv:2401.10774.
- dflash: umbrella slot for DeepFlash-style draft stacks.

Quality benchmarks (quality_eval_registry.json)
- ruler_v1: Hsieh et al., "RULER: What's the Real Context Size of
  Your Long-Context Language Models?" (NVIDIA, 2024), arXiv:2404.06654.
  Primary long-context retrieval signal for KV quantization and
  compression at 32K–1M.
- longbench_v2: Bai et al., "LongBench v2: Towards Deeper
  Understanding and Reasoning on Realistic Long-context Multitasks"
  (THUDM, 2024), arXiv:2412.15204. Complements RULER for
  reasoning-heavy long-context workloads.
- humaneval: Chen et al., "Evaluating Large Language Models Trained
  on Code" (OpenAI Codex paper, 2021), arXiv:2107.03374.
- math_500: 500-problem subset of the MATH dataset (Hendrycks et al.,
  "Measuring Mathematical Problem Solving With the MATH Dataset",
  2021, arXiv:2103.03874). Detects chain-of-thought degradation from
  aggressive KV quantization — the specific failure mode the hard
  gate is designed to catch.
Copilot AI review requested due to automatic review settings April 17, 2026 07:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an experimental ISB1 “mechanism_eval” schema to classify replay results by optimization mechanism/variant and enforce a hard gate: compression mechanisms cannot be labeled support_status=supported unless a registered quality benchmark has quality_eval_status=completed.

Changes:

  • Introduces utils/mechanism_eval.py with env-driven mechanism fields, registry loading, and validation helpers.
  • Extends result processing + gating + DB schema to carry/store mechanism metadata and enforce quality/speculative requirements.
  • Adds registries, workflows/configs, and tests for the new mechanism_eval wiring.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
utils/mechanism_eval.py New mechanism_eval field catalog, registry loaders, validation, and “requires completed quality eval” predicate.
utils/process_result_isb1.py Emits mechanism_eval fields on every processed row and attaches validation metadata.
utils/gate_isb1.py Adds mechanism_compression_quality hard gate enforcing registry + quality-eval + speculative requirements.
datasets/isb1/scripts/isb1_results_db.py Adds additive SQLite columns + migrations and CLI ingest flags for mechanism_eval fields.
datasets/isb1/registry/mechanism_variant_registry.json Registry of allowed mechanism × variant pairs.
datasets/isb1/registry/quality_eval_registry.json Registry of allowed quality-eval harness IDs.
.github/workflows/run-isb1-mechanism-eval.yml New dispatch workflow intended to run mechanism-tagged ISB1 sweeps.
.github/configs/isb1-mechanism-baseline.yaml New baseline mechanism config cells.
.github/configs/isb1-mechanism-fp8-kv.yaml New FP8 KV quantization mechanism config cells (pending quality eval).
utils/test_mechanism_eval.py Unit tests for field parsing, registry validation, and DB migration idempotency.
utils/test_process_result_isb1_mechanism.py Subprocess integration tests for mechanism_eval fields in processed output.
utils/test_gate_isb1.py Gate tests covering baseline, compression+quality requirements, and speculative requirements.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +24 to +26
canonical-model-id: deepseek_r1_0528
mechanism: baseline
mechanism-variant: none
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These top-level mechanism / mechanism-variant keys are not part of the current ISB1 config schema consumed by utils/matrix_logic/generate_sweep_configs.py isb1-sweep. load_isb1_config_files() validates configs with extra='forbid', so this file will fail matrix generation until the ISB1 master-config Pydantic model (and downstream matrix entry model) is extended to allow these fields and propagate them into the generated matrix entries.

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +37
mechanism: kv_quantization
mechanism-variant: fp8_e4m3
compression-method: fp8_e4m3
compression-scope: kv_cache
quality-eval-id: ruler_v1
quality-eval-status: pending
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config introduces mechanism/quality-eval keys (mechanism, mechanism-variant, compression-*, quality-eval-*) that are currently rejected by the ISB1 config validator used by generate_sweep_configs.py isb1-sweep (it forbids unknown fields). As-is, the workflow’s setup step will error during matrix generation; to make this runnable, the ISB1 master-config + matrix-entry Pydantic models and the isb1-sweep generator need to explicitly accept and forward these fields.

Copilot uses AI. Check for mistakes.
offload-mode: ${{ matrix.config.offload-mode || '' }}
kv-cache-dtype: ${{ matrix.config.kv-cache-dtype || '' }}
disable-prefix-caching: ${{ matrix.config.disable-prefix-caching || '' }}
workload-type: ${{ matrix.config.workload-type || '' }}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sweep job only forwards the standard ISB1 inputs into benchmark-isb1-tmpl.yml; none of the mechanism_eval fields from the config (e.g., mechanism, mechanism-variant, quality-eval-id/status, draft/speculative fields) are passed through as workflow inputs/env. As a result, process_result_isb1.py will see no MECHANISM*/QUALITY_* env vars and will default rows to mechanism="baseline", making the mechanism_eval wiring ineffective in real runs. To fix, extend the template workflow’s workflow_call inputs + env mapping and add corresponding with: entries here so these fields reach the runner environment.

Suggested change
workload-type: ${{ matrix.config.workload-type || '' }}
workload-type: ${{ matrix.config.workload-type || '' }}
mechanism: ${{ matrix.config.mechanism || '' }}
mechanism-variant: ${{ matrix.config.mechanism-variant || '' }}
quality-eval-id: ${{ matrix.config.quality-eval-id || '' }}
quality-eval-status: ${{ matrix.config.quality-eval-status || '' }}
draft-model: ${{ matrix.config.draft-model || '' }}
draft-model-prefix: ${{ matrix.config.draft-model-prefix || '' }}
speculative-model: ${{ matrix.config.speculative-model || '' }}
speculative-model-prefix: ${{ matrix.config.speculative-model-prefix || '' }}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants