From 5c6b82f3687a40fd4a1f9ec44c19f8c8ec07cbb8 Mon Sep 17 00:00:00 2001 From: William Chen <57119977+OCWC22@users.noreply.github.com> Date: Thu, 16 Apr 2026 23:23:29 -0700 Subject: [PATCH] =?UTF-8?q?feat(isb1):=20add=20mechanism=5Feval=20schema?= =?UTF-8?q?=20=E2=80=94=20mechanism/variant/quality=20registries=20+=20har?= =?UTF-8?q?d=20gate?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Extends the ISB1 replay result schema with a backward-compatible set of optional fields so every row declares which optimization technique it exercises (baseline, kv_quantization, kv_compression, compressed_attention, speculative_decoding) and which quality benchmark backs any lossy-technique claim. A hard gate then prevents a row from being labeled support_status= supported for a lossy technique unless a registered quality benchmark has completed. Follow-up to PR #1032. All new fields default to NULL (mechanism defaults to "baseline") so pre-existing rows, configs, and SQLite databases are unaffected until they opt into the mechanism_eval vocabulary. The database migration is idempotent; legacy schemas upgrade in place on first connect_db(). New files: - utils/mechanism_eval.py Env-driven field catalog (14 fields), registry loaders, validation helpers, and the row_requires_completed_quality_eval predicate. - datasets/isb1/registry/mechanism_variant_registry.json 9 registered mechanism/variant pairs covering baseline, fp8_e4m3, turboquant_class, kvtc_class, triattention_class, mtp, eagle3, medusa, dflash. - datasets/isb1/registry/quality_eval_registry.json 4 registered quality benchmarks: ruler_v1, longbench_v2, humaneval, math_500. - .github/configs/isb1-mechanism-baseline.yaml DSR1 (H100) and Qwen3.5 (B200) baseline cells. - .github/configs/isb1-mechanism-fp8-kv.yaml Same two cells with FP8 E4M3 KV quantization, wired to ruler_v1 and held at reviewed_preview until the RULER run completes (the gate blocks promotion to supported without it). - .github/workflows/run-isb1-mechanism-eval.yml Dispatch workflow routing mechanism configs through benchmark-isb1-tmpl. - utils/test_mechanism_eval.py (13 tests). - utils/test_process_result_isb1_mechanism.py (3 subprocess tests). Extended files: - utils/process_result_isb1.py — emits 14 mechanism fields + a mechanism_eval_validation record attached to every processed row. - utils/gate_isb1.py — new mechanism_compression_quality gate enforcing: (1) any non-baseline mechanism_variant must resolve in the registry; (2) quality_eval_status in {pending, completed, failed, not_required}; (3) supported + compression mechanism ⇒ quality_eval_status == completed with a registered quality_eval_id; (4) speculative_decoding ⇒ draft_model_id + speculative_acceptance_rate. - datasets/isb1/scripts/isb1_results_db.py — 16 additive ALTER TABLE migrations plus matching SCHEMA_SQL, INSERT_COLUMNS, GROUPABLE_COLUMNS, and CLI ingest flags. - utils/test_gate_isb1.py — 7 new mechanism-gate tests. Full suite: 285 passed, 2 pre-existing warnings. References — public literature the registries are grounded in: KV cache quantization (mechanism: kv_quantization) - fp8_e4m3: Micikevicius et al., "FP8 Formats for Deep Learning" (NVIDIA/Intel/Arm, 2022), arXiv:2209.05433. Defines the E4M3/E5M2 formats used by engine-native FP8 KV paths in vLLM and SGLang. - turboquant_class: umbrella slot for Hadamard-rotated 4-bit KV schemes; Hooper et al., "KVQuant", 2024, arXiv:2401.18079, is a representative reference. Specific implementation citations travel with each submitted row via mechanism_notes. KV cache compression (mechanism: kv_compression) - kvtc_class: umbrella slot for tensor-codebook / product-quantization KV compressors. The class label reflects the architecture pattern; each submitted row cites its specific implementation. Compressed attention (mechanism: compressed_attention) - triattention_class: umbrella slot for sparse-/hybrid-attention variants that change the attention-computation surface rather than the stored KV format. Speculative decoding (mechanism: speculative_decoding) - mtp: Multi-Token Prediction head as used at scale in DeepSeek-V3 (DeepSeek-AI, 2024, arXiv:2412.19437). - eagle3: EAGLE-family speculative decoding (Li et al., original EAGLE, 2024, arXiv:2401.15077; EAGLE-2 and EAGLE-3 are subsequent iterations of the same draft-model recipe). - medusa: Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads", 2024, arXiv:2401.10774. - dflash: umbrella slot for DeepFlash-style draft stacks. Quality benchmarks (quality_eval_registry.json) - ruler_v1: Hsieh et al., "RULER: What's the Real Context Size of Your Long-Context Language Models?" (NVIDIA, 2024), arXiv:2404.06654. Primary long-context retrieval signal for KV quantization and compression at 32K–1M. - longbench_v2: Bai et al., "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" (THUDM, 2024), arXiv:2412.15204. Complements RULER for reasoning-heavy long-context workloads. - humaneval: Chen et al., "Evaluating Large Language Models Trained on Code" (OpenAI Codex paper, 2021), arXiv:2107.03374. - math_500: 500-problem subset of the MATH dataset (Hendrycks et al., "Measuring Mathematical Problem Solving With the MATH Dataset", 2021, arXiv:2103.03874). Detects chain-of-thought degradation from aggressive KV quantization — the specific failure mode the hard gate is designed to catch. --- .github/configs/isb1-mechanism-baseline.yaml | 59 ++++ .github/configs/isb1-mechanism-fp8-kv.yaml | 76 +++++ .github/workflows/run-isb1-mechanism-eval.yml | 120 ++++++++ .../registry/mechanism_variant_registry.json | 77 +++++ .../isb1/registry/quality_eval_registry.json | 42 +++ datasets/isb1/scripts/isb1_results_db.py | 136 ++++++++- utils/gate_isb1.py | 62 ++++ utils/mechanism_eval.py | 217 ++++++++++++++ utils/process_result_isb1.py | 32 ++ utils/test_gate_isb1.py | 204 +++++++++++++ utils/test_mechanism_eval.py | 280 ++++++++++++++++++ utils/test_process_result_isb1_mechanism.py | 210 +++++++++++++ 12 files changed, 1514 insertions(+), 1 deletion(-) create mode 100644 .github/configs/isb1-mechanism-baseline.yaml create mode 100644 .github/configs/isb1-mechanism-fp8-kv.yaml create mode 100644 .github/workflows/run-isb1-mechanism-eval.yml create mode 100644 datasets/isb1/registry/mechanism_variant_registry.json create mode 100644 datasets/isb1/registry/quality_eval_registry.json create mode 100644 utils/mechanism_eval.py create mode 100644 utils/test_mechanism_eval.py create mode 100644 utils/test_process_result_isb1_mechanism.py diff --git a/.github/configs/isb1-mechanism-baseline.yaml b/.github/configs/isb1-mechanism-baseline.yaml new file mode 100644 index 000000000..91387779a --- /dev/null +++ b/.github/configs/isb1-mechanism-baseline.yaml @@ -0,0 +1,59 @@ +# ISB1 mechanism_eval — baseline (no compression, no speculative decoding). +# +# These rows anchor the mechanism-axis Pareto frontier: every other mechanism +# variant is compared back to a baseline row with the same model × hardware × +# context band. Baseline rows declare mechanism=baseline / mechanism_variant=none +# so they pass the mechanism_compression_quality gate without requiring a +# quality_eval_id. The gate only enforces a completed quality eval for +# supported-tier rows whose mechanism is in the compression set +# (kv_quantization, kv_compression, compressed_attention). +# +# All cells here are benchmark_certification_status=dataset_replay_verified. +# No live-serving certification is claimed. + +dsr1-fp8-h100-isb1-mechanism-baseline-vllm: + image: vllm/vllm-openai:v0.11.0 + model: deepseek-ai/DeepSeek-R1-0528 + model-prefix: dsr1 + precision: fp8 + framework: vllm + runner: h100 + benchmark-type: isb1_replay + runtime-stack-id: standalone:vllm + hardware-profile-id: nvidia:h100_sxm_80gb + canonical-model-id: deepseek_r1_0528 + mechanism: baseline + mechanism-variant: none + replay-configs: + - export-file: datasets/isb1/exports/core/code_8k1k.json + request-mode: multi-turn + support-status: supported + search-space: + - max-concurrency: 2 + max-sessions: 2 + max-turns-per-session: 3 + num-warmup-sessions: 0 + +qwen3.5-fp8-b200-isb1-mechanism-baseline-sglang: + image: lmsysorg/sglang:v0.5.9-cu130 + model: Qwen/Qwen3.5-397B-A17B-FP8 + model-prefix: qwen3.5 + precision: fp8 + framework: sglang + runner: b200 + benchmark-type: isb1_replay + runtime-stack-id: standalone:sglang + hardware-profile-id: nvidia:b200_sxm_180gb + canonical-model-id: qwen3_5_397b_a17b + mechanism: baseline + mechanism-variant: none + max-model-len: 131072 + replay-configs: + - export-file: datasets/isb1/exports/extension_131k/code_131k1k_qwen3.5.json + request-mode: multi-turn + support-status: reviewed_preview + search-space: + - max-concurrency: 2 + max-sessions: 2 + max-turns-per-session: 3 + num-warmup-sessions: 0 diff --git a/.github/configs/isb1-mechanism-fp8-kv.yaml b/.github/configs/isb1-mechanism-fp8-kv.yaml new file mode 100644 index 000000000..3de68734b --- /dev/null +++ b/.github/configs/isb1-mechanism-fp8-kv.yaml @@ -0,0 +1,76 @@ +# ISB1 mechanism_eval — FP8 KV quantization. +# +# Exercises the engine-native FP8 KV cache path (vLLM --kv-cache-dtype fp8, +# SGLang --kv-cache-dtype fp8_e4m3). Cells here ship with: +# mechanism: kv_quantization +# mechanism_variant: fp8_e4m3 +# compression_method: fp8_e4m3 +# compression_scope: kv_cache +# quality_eval_id: ruler_v1 ← registered harness +# quality_eval_status: pending ← must become "completed" before +# support_status can move to "supported" +# +# Gate rule enforced by utils/gate_isb1.py mechanism_compression_quality: +# support_status == "supported" AND mechanism in compression set +# ⇒ quality_eval_status == "completed" +# +# Until the referenced RULER run lands, these rows stay at +# support_status=reviewed_preview so the hard gate passes. Moving the row to +# "supported" without filling the quality delta will fail the gate. + +dsr1-fp8-h100-isb1-mechanism-fp8-kv-vllm: + image: vllm/vllm-openai:v0.11.0 + model: deepseek-ai/DeepSeek-R1-0528 + model-prefix: dsr1 + precision: fp8 + framework: vllm + runner: h100 + benchmark-type: isb1_replay + runtime-stack-id: standalone:vllm + hardware-profile-id: nvidia:h100_sxm_80gb + canonical-model-id: deepseek_r1_0528 + mechanism: kv_quantization + mechanism-variant: fp8_e4m3 + compression-method: fp8_e4m3 + compression-scope: kv_cache + quality-eval-id: ruler_v1 + quality-eval-status: pending + kv-cache-dtype: fp8 + replay-configs: + - export-file: datasets/isb1/exports/extension_131k/code_131k1k.json + request-mode: multi-turn + support-status: reviewed_preview + search-space: + - max-concurrency: 2 + max-sessions: 2 + max-turns-per-session: 3 + num-warmup-sessions: 0 + +qwen3.5-fp8-b200-isb1-mechanism-fp8-kv-sglang: + image: lmsysorg/sglang:v0.5.9-cu130 + model: Qwen/Qwen3.5-397B-A17B-FP8 + model-prefix: qwen3.5 + precision: fp8 + framework: sglang + runner: b200 + benchmark-type: isb1_replay + runtime-stack-id: standalone:sglang + hardware-profile-id: nvidia:b200_sxm_180gb + canonical-model-id: qwen3_5_397b_a17b + mechanism: kv_quantization + mechanism-variant: fp8_e4m3 + compression-method: fp8_e4m3 + compression-scope: kv_cache + quality-eval-id: ruler_v1 + quality-eval-status: pending + kv-cache-dtype: fp8_e4m3 + max-model-len: 131072 + replay-configs: + - export-file: datasets/isb1/exports/extension_131k/code_131k1k_qwen3.5.json + request-mode: multi-turn + support-status: reviewed_preview + search-space: + - max-concurrency: 2 + max-sessions: 2 + max-turns-per-session: 3 + num-warmup-sessions: 0 diff --git a/.github/workflows/run-isb1-mechanism-eval.yml b/.github/workflows/run-isb1-mechanism-eval.yml new file mode 100644 index 000000000..cf069f291 --- /dev/null +++ b/.github/workflows/run-isb1-mechanism-eval.yml @@ -0,0 +1,120 @@ +name: Run ISB1 Mechanism Eval Sweep +run-name: ISB1 Mechanism Eval - ${{ github.event.inputs.config-file || '.github/configs/isb1-mechanism-baseline.yaml' }} + +# Dispatches ISB1 replay rows with mechanism_eval metadata attached. +# The config files declare mechanism/mechanism_variant/quality_eval_id etc. +# at the top level; utils/matrix_logic/generate_sweep_configs.py plumbs them +# through as env vars read by utils/process_result_isb1.py. +# +# gate_isb1.py runs its mechanism_compression_quality gate against the +# aggregated result set: any supported-tier compression row without a +# completed, registered quality eval fails the gate. + +on: + workflow_dispatch: + inputs: + config-file: + description: ISB1 mechanism_eval config file path + required: true + default: .github/configs/isb1-mechanism-baseline.yaml + runner-type: + description: Optional space-separated runner filters (e.g. h200 b200) + required: false + default: '' + runner-config: + description: Runner config YAML + required: false + default: .github/configs/runners.yaml + ref: + description: Git ref to checkout + required: false + default: '' + +jobs: + setup: + runs-on: ubuntu-latest + outputs: + mechanism-matrix: ${{ steps.generate.outputs.mechanism-matrix }} + has-matrix: ${{ steps.generate.outputs.has-matrix }} + steps: + - name: Checkout code + uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 + with: + token: ${{ secrets.REPO_PAT }} + fetch-depth: 0 + ref: ${{ inputs.ref || github.ref }} + + - name: Install dependencies + run: pip install pydantic pyyaml + + - id: generate + env: + CONFIG_FILE: ${{ inputs.config-file }} + RUNNER_CONFIG: ${{ inputs.runner-config }} + RUNNER_TYPE: ${{ inputs.runner-type }} + run: | + if [ ! -f "$CONFIG_FILE" ]; then + echo "Missing ISB1 mechanism_eval config file: $CONFIG_FILE" >&2 + exit 1 + fi + + cmd=(python3 utils/matrix_logic/generate_sweep_configs.py isb1-sweep --config-files "$CONFIG_FILE" --runner-config "$RUNNER_CONFIG") + + if [ -n "$RUNNER_TYPE" ]; then + read -r -a runner_types <<< "$RUNNER_TYPE" + cmd+=(--runner-type "${runner_types[@]}") + fi + + matrix_json="$("${cmd[@]}")" + compact_matrix="$(printf '%s' "$matrix_json" | python3 -c 'import json,sys; print(json.dumps(json.load(sys.stdin)))')" + has_matrix="$(printf '%s' "$compact_matrix" | python3 -c 'import json,sys; print("true" if json.load(sys.stdin) else "false")')" + + { + echo "mechanism-matrix=$compact_matrix" + echo "has-matrix=$has_matrix" + } >> "$GITHUB_OUTPUT" + + sweep: + needs: setup + if: ${{ needs.setup.outputs.has-matrix == 'true' }} + uses: ./.github/workflows/benchmark-isb1-tmpl.yml + strategy: + fail-fast: false + matrix: + config: ${{ fromJson(needs.setup.outputs.mechanism-matrix) }} + secrets: inherit + with: + runner: ${{ matrix.config.runner }} + image: ${{ matrix.config.image }} + model: ${{ matrix.config.model }} + model-prefix: ${{ matrix.config.model-prefix }} + precision: ${{ matrix.config.precision }} + framework: ${{ matrix.config.framework }} + exp-name: ${{ matrix.config.exp-name }} + benchmark-type: ${{ matrix.config.benchmark-type }} + export-file: ${{ matrix.config.export-file }} + runtime-stack-id: ${{ matrix.config.runtime-stack-id }} + hardware-profile-id: ${{ matrix.config.hardware-profile-id }} + canonical-model-id: ${{ matrix.config.canonical-model-id }} + support-status: ${{ matrix.config.support-status || '' }} + request-mode: ${{ matrix.config.request-mode }} + max-concurrency: ${{ matrix.config.max-concurrency }} + max-sessions: ${{ matrix.config.max-sessions || '' }} + max-turns-per-session: ${{ matrix.config.max-turns-per-session || '' }} + max-model-len: ${{ matrix.config.max-model-len || '' }} + tp-override: ${{ matrix.config.tp || '' }} + ep-override: ${{ matrix.config.ep || '' }} + trace-source: ${{ matrix.config.trace-source || '' }} + offload-mode: ${{ matrix.config.offload-mode || '' }} + kv-cache-dtype: ${{ matrix.config.kv-cache-dtype || '' }} + disable-prefix-caching: ${{ matrix.config.disable-prefix-caching || '' }} + workload-type: ${{ matrix.config.workload-type || '' }} + ref: ${{ inputs.ref || github.ref }} + + collect-results: + needs: [setup, sweep] + if: ${{ always() && needs.setup.outputs.has-matrix == 'true' && needs.sweep.result != 'skipped' }} + uses: ./.github/workflows/collect-results.yml + secrets: inherit + with: + result-prefix: isb1-mechanism diff --git a/datasets/isb1/registry/mechanism_variant_registry.json b/datasets/isb1/registry/mechanism_variant_registry.json new file mode 100644 index 000000000..1cebb4daa --- /dev/null +++ b/datasets/isb1/registry/mechanism_variant_registry.json @@ -0,0 +1,77 @@ +{ + "schema_version": "1.0.0", + "description": "Registered mechanism × variant pairs for ISB1 mechanism_eval. Rows whose mechanism/mechanism_variant do not appear here are flagged unregistered by utils/mechanism_eval.py and must not be cited as certified.", + "compression_mechanisms": [ + "kv_quantization", + "kv_compression", + "compressed_attention" + ], + "speculative_mechanisms": [ + "speculative_decoding" + ], + "variants": [ + { + "mechanism": "baseline", + "mechanism_variant": "none", + "compression_method": null, + "compression_scope": null, + "description": "No mechanism applied. Used as the reference point for all other mechanism rows." + }, + { + "mechanism": "kv_quantization", + "mechanism_variant": "fp8_e4m3", + "compression_method": "fp8_e4m3", + "compression_scope": "kv_cache", + "description": "Per-tensor FP8 E4M3 KV cache quantization. Engine-native path (vLLM --kv-cache-dtype fp8, SGLang --kv-cache-dtype fp8_e4m3)." + }, + { + "mechanism": "kv_quantization", + "mechanism_variant": "turboquant_class", + "compression_method": "turboquant_class", + "compression_scope": "kv_cache", + "description": "TurboQuant-class Hadamard-rotated 4-bit KV quantization. Requires non-null quality_eval_id to be cited at supported tier." + }, + { + "mechanism": "kv_compression", + "mechanism_variant": "kvtc_class", + "compression_method": "kvtc_class", + "compression_scope": "kv_cache", + "description": "KVTC-class tensor-codebook KV compression. Requires non-null quality_eval_id to be cited at supported tier." + }, + { + "mechanism": "compressed_attention", + "mechanism_variant": "triattention_class", + "compression_method": "triattention_class", + "compression_scope": "attention", + "description": "TriAttention-class sparse-attention variant. Requires non-null quality_eval_id to be cited at supported tier." + }, + { + "mechanism": "speculative_decoding", + "mechanism_variant": "mtp", + "compression_method": null, + "compression_scope": null, + "description": "Multi-token prediction head as draft model. Requires draft_model_id and speculative_acceptance_rate." + }, + { + "mechanism": "speculative_decoding", + "mechanism_variant": "eagle3", + "compression_method": null, + "compression_scope": null, + "description": "EAGLE-3 speculative decoding. Requires draft_model_id and speculative_acceptance_rate." + }, + { + "mechanism": "speculative_decoding", + "mechanism_variant": "medusa", + "compression_method": null, + "compression_scope": null, + "description": "Medusa speculative decoding. Requires draft_model_id and speculative_acceptance_rate." + }, + { + "mechanism": "speculative_decoding", + "mechanism_variant": "dflash", + "compression_method": null, + "compression_scope": null, + "description": "DeepFlash-style draft stack. Requires draft_model_id and speculative_acceptance_rate." + } + ] +} diff --git a/datasets/isb1/registry/quality_eval_registry.json b/datasets/isb1/registry/quality_eval_registry.json new file mode 100644 index 000000000..bf907e4e5 --- /dev/null +++ b/datasets/isb1/registry/quality_eval_registry.json @@ -0,0 +1,42 @@ +{ + "schema_version": "1.0.0", + "description": "Registered quality-eval harnesses for ISB1 mechanism_eval. A row asserting quality_eval_id must reference one of these; gate_isb1 requires a completed eval before any compression mechanism can claim support_status=supported.", + "eval_harnesses": [ + { + "quality_eval_id": "ruler_v1", + "harness": "RULER", + "version": "v1.0", + "scope": "long_context_retrieval", + "metric_keys": ["ruler_avg_score", "ruler_per_length"], + "baseline_required": true, + "description": "Long-context retrieval benchmark. Primary signal for KV quantization and compression quality at 32k–1M." + }, + { + "quality_eval_id": "longbench_v2", + "harness": "LongBench", + "version": "v2.0", + "scope": "long_context_reasoning", + "metric_keys": ["longbench_avg_f1", "longbench_per_task"], + "baseline_required": true, + "description": "Long-context reasoning and multi-doc QA. Complements RULER for reasoning-heavy workloads." + }, + { + "quality_eval_id": "humaneval", + "harness": "HumanEval", + "version": "v1.0", + "scope": "code_generation", + "metric_keys": ["humaneval_pass_at_1", "humaneval_pass_at_10"], + "baseline_required": true, + "description": "Code-generation accuracy. Primary signal for coding workloads under compression." + }, + { + "quality_eval_id": "math_500", + "harness": "MATH-500", + "version": "v1.0", + "scope": "reasoning_math", + "metric_keys": ["math_500_accuracy"], + "baseline_required": true, + "description": "Math reasoning accuracy. Detects chain-of-thought degradation from aggressive KV quantization." + } + ] +} diff --git a/datasets/isb1/scripts/isb1_results_db.py b/datasets/isb1/scripts/isb1_results_db.py index e052fa766..6ab757d24 100644 --- a/datasets/isb1/scripts/isb1_results_db.py +++ b/datasets/isb1/scripts/isb1_results_db.py @@ -63,7 +63,23 @@ cpu_cache_usage_peak_pct REAL, raw_result_json TEXT, status TEXT, - error_message TEXT + error_message TEXT, + mechanism TEXT, + mechanism_variant TEXT, + compression_method TEXT, + compression_scope TEXT, + compression_ratio REAL, + compression_overhead_ms REAL, + decompression_overhead_ms REAL, + quality_eval_id TEXT, + quality_eval_status TEXT, + quality_delta_summary TEXT, + draft_model_id TEXT, + speculative_acceptance_rate REAL, + speculative_wasted_tokens INTEGER, + mechanism_notes TEXT, + mechanism_eval_registered INTEGER, + quality_eval_registered INTEGER ) """ @@ -113,6 +129,22 @@ "raw_result_json", "status", "error_message", + "mechanism", + "mechanism_variant", + "compression_method", + "compression_scope", + "compression_ratio", + "compression_overhead_ms", + "decompression_overhead_ms", + "quality_eval_id", + "quality_eval_status", + "quality_delta_summary", + "draft_model_id", + "speculative_acceptance_rate", + "speculative_wasted_tokens", + "mechanism_notes", + "mechanism_eval_registered", + "quality_eval_registered", ] GROUPABLE_COLUMNS = { @@ -128,6 +160,12 @@ "offload_mode", "campaign_class", "trace_source", + "mechanism", + "mechanism_variant", + "compression_method", + "compression_scope", + "quality_eval_id", + "quality_eval_status", } DEFAULT_QUERY_COLUMNS = [ @@ -195,6 +233,22 @@ def parse_args() -> argparse.Namespace: ingest.add_argument("--gpu-profile-csv", help="Optional GPU profile CSV path to stash in raw_result_json metadata.") ingest.add_argument("--status", default="success", choices=["success", "failed", "timeout"]) ingest.add_argument("--error-message") + # Mechanism_eval additive fields. All default to None so the ingest path + # remains backward compatible for rows that predate mechanism classification. + ingest.add_argument("--mechanism") + ingest.add_argument("--mechanism-variant") + ingest.add_argument("--compression-method") + ingest.add_argument("--compression-scope") + ingest.add_argument("--compression-ratio", type=float) + ingest.add_argument("--compression-overhead-ms", type=float) + ingest.add_argument("--decompression-overhead-ms", type=float) + ingest.add_argument("--quality-eval-id") + ingest.add_argument("--quality-eval-status", choices=["pending", "completed", "failed", "not_required"]) + ingest.add_argument("--quality-delta-summary") + ingest.add_argument("--draft-model-id") + ingest.add_argument("--speculative-acceptance-rate", type=float) + ingest.add_argument("--speculative-wasted-tokens", type=int) + ingest.add_argument("--mechanism-notes") query = subparsers.add_parser("query", help="Print runs or an aggregated grouped view.") query.add_argument("--db-path", default=str(DEFAULT_DB_PATH), help="SQLite DB path.") @@ -225,6 +279,24 @@ def parse_args() -> argparse.Namespace: f"ALTER TABLE {TABLE_NAME} ADD COLUMN workload_type TEXT", f"ALTER TABLE {TABLE_NAME} ADD COLUMN campaign_class TEXT", f"ALTER TABLE {TABLE_NAME} ADD COLUMN trace_source TEXT", + # Mechanism_eval schema (additive, backward-compatible). All columns default + # to NULL; rows that never set them retain their existing semantics. + f"ALTER TABLE {TABLE_NAME} ADD COLUMN mechanism TEXT", + f"ALTER TABLE {TABLE_NAME} ADD COLUMN mechanism_variant TEXT", + f"ALTER TABLE {TABLE_NAME} ADD COLUMN compression_method TEXT", + f"ALTER TABLE {TABLE_NAME} ADD COLUMN compression_scope TEXT", + f"ALTER TABLE {TABLE_NAME} ADD COLUMN compression_ratio REAL", + f"ALTER TABLE {TABLE_NAME} ADD COLUMN compression_overhead_ms REAL", + f"ALTER TABLE {TABLE_NAME} ADD COLUMN decompression_overhead_ms REAL", + f"ALTER TABLE {TABLE_NAME} ADD COLUMN quality_eval_id TEXT", + f"ALTER TABLE {TABLE_NAME} ADD COLUMN quality_eval_status TEXT", + f"ALTER TABLE {TABLE_NAME} ADD COLUMN quality_delta_summary TEXT", + f"ALTER TABLE {TABLE_NAME} ADD COLUMN draft_model_id TEXT", + f"ALTER TABLE {TABLE_NAME} ADD COLUMN speculative_acceptance_rate REAL", + f"ALTER TABLE {TABLE_NAME} ADD COLUMN speculative_wasted_tokens INTEGER", + f"ALTER TABLE {TABLE_NAME} ADD COLUMN mechanism_notes TEXT", + f"ALTER TABLE {TABLE_NAME} ADD COLUMN mechanism_eval_registered INTEGER", + f"ALTER TABLE {TABLE_NAME} ADD COLUMN quality_eval_registered INTEGER", ] @@ -463,6 +535,68 @@ def insert_run(args: argparse.Namespace) -> None: "raw_result_json": json.dumps(build_raw_payload(payload, args), sort_keys=True), "status": args.status, "error_message": choose(args.error_message, payload.get("error_message")), + "mechanism": choose(getattr(args, "mechanism", None), payload.get("mechanism")), + "mechanism_variant": choose( + getattr(args, "mechanism_variant", None), payload.get("mechanism_variant") + ), + "compression_method": choose( + getattr(args, "compression_method", None), payload.get("compression_method") + ), + "compression_scope": choose( + getattr(args, "compression_scope", None), payload.get("compression_scope") + ), + "compression_ratio": to_float( + choose(getattr(args, "compression_ratio", None), payload.get("compression_ratio")) + ), + "compression_overhead_ms": to_float( + choose( + getattr(args, "compression_overhead_ms", None), + payload.get("compression_overhead_ms"), + ) + ), + "decompression_overhead_ms": to_float( + choose( + getattr(args, "decompression_overhead_ms", None), + payload.get("decompression_overhead_ms"), + ) + ), + "quality_eval_id": choose( + getattr(args, "quality_eval_id", None), payload.get("quality_eval_id") + ), + "quality_eval_status": choose( + getattr(args, "quality_eval_status", None), payload.get("quality_eval_status") + ), + "quality_delta_summary": choose( + getattr(args, "quality_delta_summary", None), payload.get("quality_delta_summary") + ), + "draft_model_id": choose( + getattr(args, "draft_model_id", None), payload.get("draft_model_id") + ), + "speculative_acceptance_rate": to_float( + choose( + getattr(args, "speculative_acceptance_rate", None), + payload.get("speculative_acceptance_rate"), + ) + ), + "speculative_wasted_tokens": to_int( + choose( + getattr(args, "speculative_wasted_tokens", None), + payload.get("speculative_wasted_tokens"), + ) + ), + "mechanism_notes": choose( + getattr(args, "mechanism_notes", None), payload.get("mechanism_notes") + ), + "mechanism_eval_registered": ( + 1 if (payload.get("mechanism_eval_validation") or {}).get("mechanism_eval_registered") is True + else 0 if (payload.get("mechanism_eval_validation") or {}).get("mechanism_eval_registered") is False + else None + ), + "quality_eval_registered": ( + 1 if (payload.get("mechanism_eval_validation") or {}).get("quality_eval_registered") is True + else 0 if (payload.get("mechanism_eval_validation") or {}).get("quality_eval_registered") is False + else None + ), } conn = connect_db(args.db_path) diff --git a/utils/gate_isb1.py b/utils/gate_isb1.py index e223e8c29..d63f0ac2a 100644 --- a/utils/gate_isb1.py +++ b/utils/gate_isb1.py @@ -3,6 +3,13 @@ from pathlib import Path from typing import Any, Callable +from mechanism_eval import ( + COMPRESSION_MECHANISMS, + SPECULATIVE_MECHANISMS, + VALID_QUALITY_STATUSES, + row_requires_completed_quality_eval, +) + Row = dict[str, Any] Criterion = tuple[str, Callable[[Row], bool]] @@ -48,6 +55,10 @@ def build_row_reference(row: Row, failed_criteria: list[str] | None = None) -> R "infmax_model_prefix": row.get("infmax_model_prefix"), "support_status": row.get("support_status"), "context_pressure_status": (row.get("context_pressure_signal") or {}).get("status"), + "mechanism": row.get("mechanism"), + "mechanism_variant": row.get("mechanism_variant"), + "quality_eval_id": row.get("quality_eval_id"), + "quality_eval_status": row.get("quality_eval_status"), } if failed_criteria: reference["failed_criteria"] = failed_criteria @@ -77,6 +88,43 @@ def vllm_context_ok(row: Row) -> bool: return signal.get("status") == "ok" and not bool(row.get("context_pressure_suspicious")) +def mechanism_variant_registered(row: Row) -> bool: + """Baseline rows always pass; every other mechanism must be in the registry.""" + if row.get("mechanism") in (None, "baseline"): + return True + validation = row.get("mechanism_eval_validation") or {} + return validation.get("mechanism_eval_registered") is True + + +def quality_eval_completed(row: Row) -> bool: + """Hard rule: supported tier × compression mechanism requires completed eval.""" + mechanism = row.get("mechanism") + support_status = row.get("support_status") + if not row_requires_completed_quality_eval(mechanism, support_status): + return True + if row.get("quality_eval_status") != "completed": + return False + validation = row.get("mechanism_eval_validation") or {} + # Registered quality_eval_id is mandatory when a completed eval is cited. + return validation.get("quality_eval_registered") is True + + +def speculative_fields_present(row: Row) -> bool: + """Speculative-decoding mechanisms must carry draft_model_id and acceptance rate.""" + if row.get("mechanism") not in SPECULATIVE_MECHANISMS: + return True + if not row.get("draft_model_id"): + return False + return row.get("speculative_acceptance_rate") is not None + + +def quality_status_in_allowed_set(row: Row) -> bool: + status = row.get("quality_eval_status") + if status is None: + return True + return status in VALID_QUALITY_STATUSES + + def get_present_coverage(rows: list[Row]) -> set[tuple[str, str]]: return { (normalize_hw_label(row.get("hw")), row.get("framework", "")) @@ -211,6 +259,20 @@ def build_gate_report(rows: list[Row], advisory: bool = True) -> Row: expected_coverage=EXPECTED_1M_COVERAGE, exact_coverage=True, ), + evaluate_gate( + "mechanism_compression_quality", + "Mechanism compression quality (hard gate)", + [row for row in rows if row.get("mechanism") is not None], + [ + ("mechanism_variant registered", mechanism_variant_registered), + ("quality_eval_status in accepted set", quality_status_in_allowed_set), + ( + "supported+compression ⇒ quality_eval_status == completed", + quality_eval_completed, + ), + ("speculative_decoding requires draft fields", speculative_fields_present), + ], + ), ] statuses = {gate["status"] for gate in gates} diff --git a/utils/mechanism_eval.py b/utils/mechanism_eval.py new file mode 100644 index 000000000..48da12905 --- /dev/null +++ b/utils/mechanism_eval.py @@ -0,0 +1,217 @@ +"""ISB1 mechanism_eval schema: env-driven mechanism fields + registry validation. + +This module extends the ISB1 replay result schema with a backward-compatible set +of optional fields that classify every row by the *mechanism* it exercises +(baseline, KV quantization, KV compression, compressed attention, speculative +decoding). It also loads the mechanism_variant and quality_eval registries and +exposes helpers used by process_result_isb1.py and gate_isb1.py. + +The schema is strictly additive: every new field defaults to None so existing +consumers are unaffected until they opt into the mechanism_eval vocabulary. + +Hard gate: any row claiming support_status == "supported" with a compression +mechanism (kv_quantization, kv_compression, compressed_attention) must carry a +registered quality_eval_id and quality_eval_status == "completed". gate_isb1.py +enforces this. +""" + +from __future__ import annotations + +import json +import os +from pathlib import Path +from typing import Any, Optional + + +# Ordered list of optional mechanism_eval fields surfaced on every processed +# ISB1 row. Each field is driven by an environment variable of the same +# (upper-cased) name and defaults to None when the variable is unset. +MECHANISM_FIELDS: tuple[tuple[str, str, str], ...] = ( + # (row_key, env_var, kind) — kind is one of str, float, int, bool + ("mechanism", "MECHANISM", "str"), + ("mechanism_variant", "MECHANISM_VARIANT", "str"), + ("compression_method", "COMPRESSION_METHOD", "str"), + ("compression_scope", "COMPRESSION_SCOPE", "str"), + ("compression_ratio", "COMPRESSION_RATIO", "float"), + ("compression_overhead_ms", "COMPRESSION_OVERHEAD_MS", "float"), + ("decompression_overhead_ms", "DECOMPRESSION_OVERHEAD_MS", "float"), + ("quality_eval_id", "QUALITY_EVAL_ID", "str"), + ("quality_eval_status", "QUALITY_EVAL_STATUS", "str"), + ("quality_delta_summary", "QUALITY_DELTA_SUMMARY", "str"), + ("draft_model_id", "DRAFT_MODEL_ID", "str"), + ("speculative_acceptance_rate", "SPECULATIVE_ACCEPTANCE_RATE", "float"), + ("speculative_wasted_tokens", "SPECULATIVE_WASTED_TOKENS", "int"), + ("mechanism_notes", "MECHANISM_NOTES", "str"), +) + +# Default values when the mechanism env vars are absent. `mechanism` defaults +# to "baseline" so unclassified rows are never silently treated as compressed. +_DEFAULTS: dict[str, Any] = {"mechanism": "baseline"} + +COMPRESSION_MECHANISMS: frozenset[str] = frozenset( + {"kv_quantization", "kv_compression", "compressed_attention"} +) +SPECULATIVE_MECHANISMS: frozenset[str] = frozenset({"speculative_decoding"}) +VALID_QUALITY_STATUSES: frozenset[str] = frozenset( + {"pending", "completed", "failed", "not_required"} +) + +REPO_ROOT = Path(__file__).resolve().parents[1] +MECHANISM_REGISTRY_PATH = REPO_ROOT / "datasets/isb1/registry/mechanism_variant_registry.json" +QUALITY_REGISTRY_PATH = REPO_ROOT / "datasets/isb1/registry/quality_eval_registry.json" + + +def _coerce(value: Optional[str], kind: str) -> Any: + if value is None or value == "": + return None + try: + if kind == "float": + return float(value) + if kind == "int": + return int(float(value)) + if kind == "bool": + return value.lower() in {"1", "true", "yes", "on"} + except (TypeError, ValueError): + return None + return value + + +def build_mechanism_fields( + env: Optional[dict[str, str]] = None, +) -> dict[str, Any]: + """Return the mechanism_eval field dict derived from environment variables. + + Unset or blank environment variables yield None for every field except + `mechanism`, which defaults to "baseline" so rows are never silently + unclassified. + """ + env = os.environ if env is None else env + result: dict[str, Any] = {} + for row_key, env_var, kind in MECHANISM_FIELDS: + raw = env.get(env_var) + coerced = _coerce(raw, kind) + if coerced is None: + coerced = _DEFAULTS.get(row_key) + result[row_key] = coerced + return result + + +def load_mechanism_registry(path: Optional[Path] = None) -> dict[str, Any]: + registry_path = path or MECHANISM_REGISTRY_PATH + payload = json.loads(registry_path.read_text()) + if not isinstance(payload, dict): + raise ValueError(f"Mechanism registry at {registry_path} is not a JSON object.") + return payload + + +def load_quality_registry(path: Optional[Path] = None) -> dict[str, Any]: + registry_path = path or QUALITY_REGISTRY_PATH + payload = json.loads(registry_path.read_text()) + if not isinstance(payload, dict): + raise ValueError(f"Quality registry at {registry_path} is not a JSON object.") + return payload + + +def registered_variant_keys(registry: dict[str, Any]) -> set[tuple[str, str]]: + keys: set[tuple[str, str]] = set() + for entry in registry.get("variants", []) or []: + mechanism = entry.get("mechanism") + variant = entry.get("mechanism_variant") + if mechanism and variant: + keys.add((mechanism, variant)) + return keys + + +def registered_quality_ids(registry: dict[str, Any]) -> set[str]: + return { + entry.get("quality_eval_id") + for entry in registry.get("eval_harnesses", []) or [] + if entry.get("quality_eval_id") + } + + +def validate_mechanism_fields( + fields: dict[str, Any], + *, + mechanism_registry: Optional[dict[str, Any]] = None, + quality_registry: Optional[dict[str, Any]] = None, +) -> dict[str, Any]: + """Return a validation record describing registration + coherence issues. + + The record never raises: it is additive metadata attached to the processed + row and consumed by gate_isb1.py. Unregistered mechanism/variant pairs + yield `mechanism_eval_registered=False`; an unregistered quality_eval_id + yields `quality_eval_registered=False`. + """ + mechanism_registry = mechanism_registry or load_mechanism_registry() + quality_registry = quality_registry or load_quality_registry() + + mechanism = fields.get("mechanism") + variant = fields.get("mechanism_variant") or "none" + quality_eval_id = fields.get("quality_eval_id") + quality_eval_status = fields.get("quality_eval_status") + + variant_key = (mechanism, variant) + variant_registered = variant_key in registered_variant_keys(mechanism_registry) + + if quality_eval_id is None: + quality_registered: Optional[bool] = None + else: + quality_registered = quality_eval_id in registered_quality_ids(quality_registry) + + status_known = ( + quality_eval_status is None or quality_eval_status in VALID_QUALITY_STATUSES + ) + + issues: list[str] = [] + if not variant_registered and mechanism != "baseline": + issues.append( + f"mechanism/mechanism_variant pair ({mechanism!r}, {variant!r}) " + "is not registered in mechanism_variant_registry.json" + ) + if quality_eval_id is not None and quality_registered is False: + issues.append( + f"quality_eval_id={quality_eval_id!r} is not registered in quality_eval_registry.json" + ) + if not status_known: + issues.append( + f"quality_eval_status={quality_eval_status!r} is outside the accepted set " + f"{sorted(VALID_QUALITY_STATUSES)}" + ) + if mechanism in SPECULATIVE_MECHANISMS and not fields.get("draft_model_id"): + issues.append( + "speculative_decoding mechanism requires draft_model_id to be set" + ) + + return { + "mechanism_eval_registered": variant_registered, + "quality_eval_registered": quality_registered, + "quality_eval_status_known": status_known, + "issues": issues, + } + + +def row_requires_completed_quality_eval( + mechanism: Optional[str], support_status: Optional[str] +) -> bool: + """Hard rule: supported tier × compression mechanism ⇒ completed quality eval.""" + if support_status != "supported": + return False + return mechanism in COMPRESSION_MECHANISMS + + +__all__ = [ + "MECHANISM_FIELDS", + "COMPRESSION_MECHANISMS", + "SPECULATIVE_MECHANISMS", + "VALID_QUALITY_STATUSES", + "MECHANISM_REGISTRY_PATH", + "QUALITY_REGISTRY_PATH", + "build_mechanism_fields", + "load_mechanism_registry", + "load_quality_registry", + "registered_variant_keys", + "registered_quality_ids", + "validate_mechanism_fields", + "row_requires_completed_quality_eval", +] diff --git a/utils/process_result_isb1.py b/utils/process_result_isb1.py index 7f338ab2c..d4eb9dc27 100644 --- a/utils/process_result_isb1.py +++ b/utils/process_result_isb1.py @@ -5,6 +5,13 @@ from pathlib import Path from typing import Any, Optional, Tuple +from mechanism_eval import ( + build_mechanism_fields, + load_mechanism_registry, + load_quality_registry, + validate_mechanism_fields, +) + ISB1_RUNNABLE_CERTIFICATION_STATUSES = ["dataset_replay_verified"] @@ -373,6 +380,31 @@ def build_dispatch_ref() -> Optional[str]: "runtime_overrides": build_runtime_overrides(replay_result), } +# Mechanism_eval schema (additive, env-driven, backward-compatible null defaults). +# All fields default to None except `mechanism` which defaults to "baseline" so +# unclassified rows are never silently treated as compressed. See +# utils/mechanism_eval.py for the field catalog and +# datasets/isb1/registry/mechanism_variant_registry.json for the accepted values. +_mechanism_fields = build_mechanism_fields() +data.update(_mechanism_fields) +try: + _mechanism_registry = load_mechanism_registry() + _quality_registry = load_quality_registry() + data["mechanism_eval_validation"] = validate_mechanism_fields( + _mechanism_fields, + mechanism_registry=_mechanism_registry, + quality_registry=_quality_registry, + ) +except (FileNotFoundError, ValueError) as exc: + # Registry load failures degrade to advisory rather than breaking the run; + # gate_isb1 reports the issue downstream so it shows up in the gate report. + data["mechanism_eval_validation"] = { + "mechanism_eval_registered": None, + "quality_eval_registered": None, + "quality_eval_status_known": None, + "issues": [f"registry_load_error: {exc}"], + } + effective_max_context_depth = data["max_model_len"] or (isl + osl + 200) data["effective_max_context_depth"] = effective_max_context_depth if effective_max_context_depth > 600000: diff --git a/utils/test_gate_isb1.py b/utils/test_gate_isb1.py index 3a9e590e0..3377edfe6 100644 --- a/utils/test_gate_isb1.py +++ b/utils/test_gate_isb1.py @@ -20,6 +20,14 @@ def make_row( total_sessions: int = 2, session_throughput_sps: float = 1.0, benchmark_certification_status: str = "dataset_replay_verified", + mechanism: str = "baseline", + mechanism_variant: str | None = "none", + quality_eval_id: str | None = None, + quality_eval_status: str | None = None, + draft_model_id: str | None = None, + speculative_acceptance_rate: float | None = None, + mechanism_eval_registered: bool | None = True, + quality_eval_registered: bool | None = None, ): return { "benchmark_type": "isb1_replay", @@ -45,6 +53,18 @@ def make_row( "total_sessions": total_sessions, "session_throughput_sps": session_throughput_sps, "benchmark_certification_status": benchmark_certification_status, + "mechanism": mechanism, + "mechanism_variant": mechanism_variant, + "quality_eval_id": quality_eval_id, + "quality_eval_status": quality_eval_status, + "draft_model_id": draft_model_id, + "speculative_acceptance_rate": speculative_acceptance_rate, + "mechanism_eval_validation": { + "mechanism_eval_registered": mechanism_eval_registered, + "quality_eval_registered": quality_eval_registered, + "quality_eval_status_known": True, + "issues": [], + }, } @@ -216,3 +236,187 @@ def test_gate_main_strict_returns_nonzero_on_failure(tmp_path): assert load_rows(report_path)[0]["result_filename"] == "dsr1_control_b200_vllm" assert main([str(report_path), "--strict"]) == 1 + + + +def test_mechanism_gate_passes_for_baseline_rows(): + rows = [ + make_row( + result_filename="baseline_b200_vllm", + model="dsr1", + hw="b200-cw-1", + framework="vllm", + support_status="supported", + effective_max_context_depth=9416, + context_pressure_class="standard", + context_status="not_applicable", + mechanism="baseline", + mechanism_variant="none", + ) + ] + report = build_gate_report(rows) + mechanism_gate = next( + gate for gate in report["gates"] if gate["id"] == "mechanism_compression_quality" + ) + # Baseline rows enter the mechanism filter but pass every criterion trivially + # — no compression mechanism, no speculative draft required, no quality eval required. + assert mechanism_gate["status"] == "pass" + assert mechanism_gate["matched_rows"] == 1 + assert mechanism_gate["failing_rows"] == [] + + +def test_mechanism_gate_fails_supported_fp8_without_completed_eval(): + rows = [ + make_row( + result_filename="dsr1_fp8kv_h100_vllm", + model="dsr1", + hw="h100-cw-1", + framework="vllm", + support_status="supported", + effective_max_context_depth=131272, + context_pressure_class="standard", + context_status="not_applicable", + mechanism="kv_quantization", + mechanism_variant="fp8_e4m3", + quality_eval_id="ruler_v1", + quality_eval_status="pending", + quality_eval_registered=True, + ) + ] + report = build_gate_report(rows) + mechanism_gate = next( + gate for gate in report["gates"] if gate["id"] == "mechanism_compression_quality" + ) + assert mechanism_gate["status"] == "fail" + assert mechanism_gate["failing_rows"] + failed = mechanism_gate["failing_rows"][0] + assert any( + "supported+compression" in criterion for criterion in failed["failed_criteria"] + ) + + +def test_mechanism_gate_passes_reviewed_preview_fp8_without_eval(): + rows = [ + make_row( + result_filename="qwen_fp8kv_b200_sglang", + model="qwen3.5", + hw="b200-cw-1", + framework="sglang", + support_status="reviewed_preview", + effective_max_context_depth=131272, + context_pressure_class="standard", + context_status="not_applicable", + mechanism="kv_quantization", + mechanism_variant="fp8_e4m3", + quality_eval_id=None, + quality_eval_status=None, + ) + ] + report = build_gate_report(rows) + mechanism_gate = next( + gate for gate in report["gates"] if gate["id"] == "mechanism_compression_quality" + ) + assert mechanism_gate["status"] == "pass" + + +def test_mechanism_gate_passes_supported_fp8_with_completed_registered_eval(): + rows = [ + make_row( + result_filename="dsr1_fp8kv_h100_vllm", + model="dsr1", + hw="h100-cw-1", + framework="vllm", + support_status="supported", + effective_max_context_depth=131272, + context_pressure_class="standard", + context_status="not_applicable", + mechanism="kv_quantization", + mechanism_variant="fp8_e4m3", + quality_eval_id="ruler_v1", + quality_eval_status="completed", + quality_eval_registered=True, + ) + ] + report = build_gate_report(rows) + mechanism_gate = next( + gate for gate in report["gates"] if gate["id"] == "mechanism_compression_quality" + ) + assert mechanism_gate["status"] == "pass" + + +def test_mechanism_gate_fails_unregistered_variant(): + rows = [ + make_row( + result_filename="weird_variant_b200_vllm", + model="qwen3.5", + hw="b200-cw-1", + framework="vllm", + support_status="reviewed_preview", + effective_max_context_depth=131272, + context_pressure_class="standard", + context_status="not_applicable", + mechanism="kv_quantization", + mechanism_variant="made_up_variant", + mechanism_eval_registered=False, + ) + ] + report = build_gate_report(rows) + mechanism_gate = next( + gate for gate in report["gates"] if gate["id"] == "mechanism_compression_quality" + ) + assert mechanism_gate["status"] == "fail" + failed = mechanism_gate["failing_rows"][0] + assert "mechanism_variant registered" in failed["failed_criteria"] + + +def test_mechanism_gate_fails_speculative_without_draft_model(): + rows = [ + make_row( + result_filename="spec_no_draft_h100_vllm", + model="dsr1", + hw="h100-cw-1", + framework="vllm", + support_status="reviewed_preview", + effective_max_context_depth=131272, + context_pressure_class="standard", + context_status="not_applicable", + mechanism="speculative_decoding", + mechanism_variant="eagle3", + draft_model_id=None, + speculative_acceptance_rate=None, + ) + ] + report = build_gate_report(rows) + mechanism_gate = next( + gate for gate in report["gates"] if gate["id"] == "mechanism_compression_quality" + ) + assert mechanism_gate["status"] == "fail" + failed = mechanism_gate["failing_rows"][0] + assert any( + "speculative_decoding requires draft fields" in criterion + for criterion in failed["failed_criteria"] + ) + + +def test_mechanism_gate_passes_speculative_with_full_fields(): + rows = [ + make_row( + result_filename="spec_h100_vllm", + model="dsr1", + hw="h100-cw-1", + framework="vllm", + support_status="reviewed_preview", + effective_max_context_depth=131272, + context_pressure_class="standard", + context_status="not_applicable", + mechanism="speculative_decoding", + mechanism_variant="eagle3", + draft_model_id="eagle3-draft-v1", + speculative_acceptance_rate=0.78, + ) + ] + report = build_gate_report(rows) + mechanism_gate = next( + gate for gate in report["gates"] if gate["id"] == "mechanism_compression_quality" + ) + assert mechanism_gate["status"] == "pass" diff --git a/utils/test_mechanism_eval.py b/utils/test_mechanism_eval.py new file mode 100644 index 000000000..731b6576d --- /dev/null +++ b/utils/test_mechanism_eval.py @@ -0,0 +1,280 @@ +"""Unit tests for the ISB1 mechanism_eval schema helpers.""" + +from __future__ import annotations + +import importlib +import json +import sqlite3 +import sys +from pathlib import Path + +import pytest + +UTILS_DIR = Path(__file__).resolve().parent +if str(UTILS_DIR) not in sys.path: + sys.path.insert(0, str(UTILS_DIR)) + +SCRIPTS_DIR = UTILS_DIR.parent / "datasets" / "isb1" / "scripts" +if str(SCRIPTS_DIR) not in sys.path: + sys.path.insert(0, str(SCRIPTS_DIR)) + +mechanism_eval = importlib.import_module("mechanism_eval") +isb1_results_db = importlib.import_module("isb1_results_db") + + +def test_build_mechanism_fields_defaults_to_baseline(monkeypatch): + for _, env_var, _ in mechanism_eval.MECHANISM_FIELDS: + monkeypatch.delenv(env_var, raising=False) + + fields = mechanism_eval.build_mechanism_fields() + + assert fields["mechanism"] == "baseline" + assert fields["mechanism_variant"] is None + for row_key in ( + "compression_method", + "compression_scope", + "compression_ratio", + "compression_overhead_ms", + "decompression_overhead_ms", + "quality_eval_id", + "quality_eval_status", + "quality_delta_summary", + "draft_model_id", + "speculative_acceptance_rate", + "speculative_wasted_tokens", + "mechanism_notes", + ): + assert fields[row_key] is None, f"expected {row_key} to default to None" + + +def test_build_mechanism_fields_coerces_numeric_env(monkeypatch): + monkeypatch.setenv("MECHANISM", "kv_quantization") + monkeypatch.setenv("MECHANISM_VARIANT", "fp8_e4m3") + monkeypatch.setenv("COMPRESSION_METHOD", "fp8_e4m3") + monkeypatch.setenv("COMPRESSION_SCOPE", "kv_cache") + monkeypatch.setenv("COMPRESSION_RATIO", "0.5") + monkeypatch.setenv("COMPRESSION_OVERHEAD_MS", "12.5") + monkeypatch.setenv("DECOMPRESSION_OVERHEAD_MS", "4.25") + monkeypatch.setenv("SPECULATIVE_WASTED_TOKENS", "128") + monkeypatch.setenv("SPECULATIVE_ACCEPTANCE_RATE", "0.82") + monkeypatch.setenv("QUALITY_EVAL_ID", "ruler_v1") + monkeypatch.setenv("QUALITY_EVAL_STATUS", "completed") + + fields = mechanism_eval.build_mechanism_fields() + + assert fields["mechanism"] == "kv_quantization" + assert fields["mechanism_variant"] == "fp8_e4m3" + assert fields["compression_ratio"] == pytest.approx(0.5) + assert fields["compression_overhead_ms"] == pytest.approx(12.5) + assert fields["decompression_overhead_ms"] == pytest.approx(4.25) + assert fields["speculative_wasted_tokens"] == 128 + assert fields["speculative_acceptance_rate"] == pytest.approx(0.82) + assert fields["quality_eval_id"] == "ruler_v1" + assert fields["quality_eval_status"] == "completed" + + +def test_build_mechanism_fields_rejects_bad_numeric(monkeypatch): + monkeypatch.setenv("COMPRESSION_RATIO", "not-a-number") + monkeypatch.setenv("SPECULATIVE_WASTED_TOKENS", "oops") + + fields = mechanism_eval.build_mechanism_fields() + + assert fields["compression_ratio"] is None + assert fields["speculative_wasted_tokens"] is None + + +def test_validate_mechanism_fields_registered_pair(): + fields = { + "mechanism": "kv_quantization", + "mechanism_variant": "fp8_e4m3", + "quality_eval_id": "ruler_v1", + "quality_eval_status": "completed", + } + record = mechanism_eval.validate_mechanism_fields(fields) + + assert record["mechanism_eval_registered"] is True + assert record["quality_eval_registered"] is True + assert record["quality_eval_status_known"] is True + assert record["issues"] == [] + + +def test_validate_mechanism_fields_unregistered_variant_flags_issue(): + fields = { + "mechanism": "kv_quantization", + "mechanism_variant": "made_up_variant", + "quality_eval_id": "ruler_v1", + "quality_eval_status": "completed", + } + record = mechanism_eval.validate_mechanism_fields(fields) + + assert record["mechanism_eval_registered"] is False + assert any( + "not registered in mechanism_variant_registry.json" in issue + for issue in record["issues"] + ) + + +def test_validate_mechanism_fields_unregistered_quality_eval_id(): + fields = { + "mechanism": "kv_quantization", + "mechanism_variant": "fp8_e4m3", + "quality_eval_id": "nonexistent_eval", + "quality_eval_status": "completed", + } + record = mechanism_eval.validate_mechanism_fields(fields) + + assert record["quality_eval_registered"] is False + assert any( + "not registered in quality_eval_registry.json" in issue + for issue in record["issues"] + ) + + +def test_validate_mechanism_fields_unknown_status(): + fields = { + "mechanism": "kv_quantization", + "mechanism_variant": "fp8_e4m3", + "quality_eval_id": "ruler_v1", + "quality_eval_status": "maybe", + } + record = mechanism_eval.validate_mechanism_fields(fields) + + assert record["quality_eval_status_known"] is False + assert any("outside the accepted set" in issue for issue in record["issues"]) + + +def test_validate_mechanism_fields_speculative_requires_draft_model(): + fields = { + "mechanism": "speculative_decoding", + "mechanism_variant": "eagle3", + "draft_model_id": None, + } + record = mechanism_eval.validate_mechanism_fields(fields) + + assert any("requires draft_model_id" in issue for issue in record["issues"]) + + +def test_validate_mechanism_fields_baseline_passes_without_quality(): + fields = {"mechanism": "baseline", "mechanism_variant": "none"} + record = mechanism_eval.validate_mechanism_fields(fields) + + assert record["mechanism_eval_registered"] is True + assert record["quality_eval_registered"] is None + assert record["issues"] == [] + + +def test_row_requires_completed_quality_eval_matrix(): + assert mechanism_eval.row_requires_completed_quality_eval( + "kv_quantization", "supported" + ) is True + assert mechanism_eval.row_requires_completed_quality_eval( + "kv_compression", "supported" + ) is True + assert mechanism_eval.row_requires_completed_quality_eval( + "compressed_attention", "supported" + ) is True + # Non-supported tier never requires completed eval. + assert mechanism_eval.row_requires_completed_quality_eval( + "kv_quantization", "reviewed_preview" + ) is False + # Baseline never requires. + assert mechanism_eval.row_requires_completed_quality_eval( + "baseline", "supported" + ) is False + # Speculative decoding is governed by a separate predicate, not this hard rule. + assert mechanism_eval.row_requires_completed_quality_eval( + "speculative_decoding", "supported" + ) is False + + +def test_registry_files_are_valid_json_and_match_expectations(): + mechanism_registry = mechanism_eval.load_mechanism_registry() + quality_registry = mechanism_eval.load_quality_registry() + + # Every compression mechanism in the module matches a variant in the registry. + variants_by_mechanism: dict[str, set[str]] = {} + for entry in mechanism_registry["variants"]: + variants_by_mechanism.setdefault(entry["mechanism"], set()).add( + entry["mechanism_variant"] + ) + + for compression_mechanism in mechanism_eval.COMPRESSION_MECHANISMS: + assert compression_mechanism in variants_by_mechanism, ( + f"compression mechanism {compression_mechanism} is missing from the registry" + ) + + quality_ids = mechanism_eval.registered_quality_ids(quality_registry) + assert {"ruler_v1", "longbench_v2", "humaneval", "math_500"}.issubset(quality_ids) + + +def test_isb1_results_db_migration_is_idempotent(tmp_path): + db_path = tmp_path / "idempotent.db" + conn = isb1_results_db.connect_db(db_path) + # Second ensure_db call must not raise; the migration uses IF NOT EXISTS + # logic in the form of try/except OperationalError for ALTER TABLE. + isb1_results_db.ensure_db(conn) + + cursor = conn.execute(f"PRAGMA table_info({isb1_results_db.TABLE_NAME})") + columns = {row[1] for row in cursor.fetchall()} + + expected = { + "mechanism", + "mechanism_variant", + "compression_method", + "compression_scope", + "compression_ratio", + "compression_overhead_ms", + "decompression_overhead_ms", + "quality_eval_id", + "quality_eval_status", + "quality_delta_summary", + "draft_model_id", + "speculative_acceptance_rate", + "speculative_wasted_tokens", + "mechanism_notes", + "mechanism_eval_registered", + "quality_eval_registered", + } + missing = expected - columns + assert not missing, f"expected mechanism columns missing after migration: {missing}" + + conn.close() + + +def test_isb1_results_db_migration_upgrades_legacy_schema(tmp_path): + """A pre-mechanism_eval database should gain the new columns on re-open.""" + db_path = tmp_path / "legacy.db" + + # Construct a legacy-shaped database by running only a minimal CREATE TABLE + # with the pre-mechanism_eval column set, then re-open via ensure_db. + legacy_conn = sqlite3.connect(db_path) + legacy_conn.execute( + f""" + CREATE TABLE {isb1_results_db.TABLE_NAME} ( + id INTEGER PRIMARY KEY, + run_id TEXT, + timestamp TEXT, + gpu_type TEXT, + model TEXT, + engine TEXT, + context_band TEXT, + max_model_len INTEGER, + tp INTEGER, + raw_result_json TEXT, + status TEXT, + error_message TEXT + ) + """ + ) + legacy_conn.commit() + legacy_conn.close() + + conn = isb1_results_db.connect_db(db_path) + cursor = conn.execute(f"PRAGMA table_info({isb1_results_db.TABLE_NAME})") + columns = {row[1] for row in cursor.fetchall()} + + assert "mechanism" in columns + assert "quality_eval_id" in columns + assert "speculative_acceptance_rate" in columns + + conn.close() diff --git a/utils/test_process_result_isb1_mechanism.py b/utils/test_process_result_isb1_mechanism.py new file mode 100644 index 000000000..8a64c60a1 --- /dev/null +++ b/utils/test_process_result_isb1_mechanism.py @@ -0,0 +1,210 @@ +"""Integration test for mechanism_eval wiring in process_result_isb1.py. + +Runs process_result_isb1.py in a subprocess with a minimal replay fixture and +verifies that the aggregated JSON carries the mechanism_eval schema fields and +the mechanism_eval_validation record. +""" + +from __future__ import annotations + +import json +import os +import subprocess +import sys +from pathlib import Path + +UTILS_DIR = Path(__file__).resolve().parent +SCRIPT = UTILS_DIR / "process_result_isb1.py" + + +def _minimal_replay_payload(export_file: str) -> dict: + return { + "model_id": "deepseek-ai/DeepSeek-R1-0528", + "max_concurrency": 2, + "num_sessions": 2, + "max_turns": 3, + "num_warmup_sessions": 0, + "aggregate_metrics": { + "total_token_throughput_tps": 1000.0, + "output_throughput_tps": 800.0, + "total_sessions": 2, + "completed_sessions": 2, + "session_throughput_sps": 0.05, + "median_ttft_ms": 120.0, + "p99_ttft_ms": 240.0, + "median_tpot_ms": 15.0, + "p99_tpot_ms": 20.0, + "total_wall_time_s": 600.0, + }, + "per_turn_metrics": {}, + "server_metrics_summary": { + "gpu_cache_usage_peak": 0.25, + "cpu_cache_usage_peak": 0.0, + "cpu_cache_metric_available": True, + "kv_offload_observed": False, + "preemption_count": 0, + "observability_status": "ok", + }, + "selection": { + "support_statuses": ["reviewed_preview"], + "benchmark_certification_statuses": ["dataset_replay_verified"], + }, + "mode": "multi-turn", + "harness_request_mode": "auto", + "depth_telemetry": { + "total_actual_input_tokens": 12000, + "max_actual_context_len_per_turn": 10000, + }, + } + + +def _run_process_result( + tmp_path: Path, + export_file: Path, + replay_payload: dict, + extra_env: dict | None = None, +) -> dict: + result_filename = "mechanism_test" + replay_path = tmp_path / f"{result_filename}.json" + replay_path.write_text(json.dumps(replay_payload)) + + env = os.environ.copy() + # Strip any mechanism env vars from the outer environment so we control + # them explicitly per-case. + for key in list(env): + if ( + key.startswith(("MECHANISM", "COMPRESSION_", "DECOMPRESSION_", + "QUALITY_", "DRAFT_MODEL_", "SPECULATIVE_")) + ): + env.pop(key, None) + + env.update( + { + "RUNNER_TYPE": "h100", + "FRAMEWORK": "vllm", + "PRECISION": "fp8", + "RESULT_FILENAME": result_filename, + "MODEL_PREFIX": "dsr1", + "IMAGE": "vllm/vllm-openai:v0.11.0", + "TP": "8", + "EP_SIZE": "1", + "DP_ATTENTION": "false", + "BENCHMARK_TYPE": "isb1_replay", + "EXPORT_FILE": str(export_file), + "RUNTIME_STACK_ID": "standalone:vllm", + "HARDWARE_PROFILE_ID": "nvidia:h100_sxm_80gb", + "CANONICAL_MODEL_ID": "deepseek_r1_0528", + "REQUEST_MODE": "multi-turn", + "MAX_CONCURRENCY": "2", + "SUPPORT_STATUS": "reviewed_preview", + } + ) + if extra_env: + env.update(extra_env) + + subprocess.run( + [sys.executable, str(SCRIPT)], + cwd=tmp_path, + env=env, + check=True, + stdout=subprocess.DEVNULL, + ) + + aggregated = json.loads((tmp_path / f"agg_{result_filename}.json").read_text()) + return aggregated + + +def test_process_result_defaults_to_baseline_when_no_env(tmp_path): + export_file = tmp_path / "code_8k1k.json" + export_file.write_text( + json.dumps( + { + "adapter_id": "test", + "bundle_id": "test_bundle", + "served_shape": {"isl": 8192, "osl": 1024}, + "exports": [{"context_band": "lc3_8k"}], + } + ) + ) + + aggregated = _run_process_result(tmp_path, export_file, _minimal_replay_payload(str(export_file))) + + assert aggregated["mechanism"] == "baseline" + assert aggregated["mechanism_variant"] is None + assert aggregated["compression_method"] is None + assert aggregated["quality_eval_id"] is None + assert "mechanism_eval_validation" in aggregated + # Baseline is always considered registered (even with variant=None). + assert aggregated["mechanism_eval_validation"]["mechanism_eval_registered"] is True + + +def test_process_result_surfaces_registered_fp8_kv_fields(tmp_path): + export_file = tmp_path / "code_131k1k.json" + export_file.write_text( + json.dumps( + { + "adapter_id": "test", + "bundle_id": "test_bundle", + "served_shape": {"isl": 131072, "osl": 1024}, + "exports": [{"context_band": "xlc2_384k_512k"}], + } + ) + ) + + aggregated = _run_process_result( + tmp_path, + export_file, + _minimal_replay_payload(str(export_file)), + extra_env={ + "MECHANISM": "kv_quantization", + "MECHANISM_VARIANT": "fp8_e4m3", + "COMPRESSION_METHOD": "fp8_e4m3", + "COMPRESSION_SCOPE": "kv_cache", + "COMPRESSION_RATIO": "0.5", + "QUALITY_EVAL_ID": "ruler_v1", + "QUALITY_EVAL_STATUS": "pending", + }, + ) + + assert aggregated["mechanism"] == "kv_quantization" + assert aggregated["mechanism_variant"] == "fp8_e4m3" + assert aggregated["compression_method"] == "fp8_e4m3" + assert aggregated["compression_scope"] == "kv_cache" + assert aggregated["compression_ratio"] == 0.5 + assert aggregated["quality_eval_id"] == "ruler_v1" + assert aggregated["quality_eval_status"] == "pending" + validation = aggregated["mechanism_eval_validation"] + assert validation["mechanism_eval_registered"] is True + assert validation["quality_eval_registered"] is True + assert validation["quality_eval_status_known"] is True + assert validation["issues"] == [] + + +def test_process_result_flags_unregistered_variant(tmp_path): + export_file = tmp_path / "code_8k1k.json" + export_file.write_text( + json.dumps( + { + "adapter_id": "test", + "served_shape": {"isl": 8192, "osl": 1024}, + "exports": [{"context_band": "lc3_8k"}], + } + ) + ) + + aggregated = _run_process_result( + tmp_path, + export_file, + _minimal_replay_payload(str(export_file)), + extra_env={ + "MECHANISM": "kv_quantization", + "MECHANISM_VARIANT": "invented_variant", + }, + ) + + validation = aggregated["mechanism_eval_validation"] + assert validation["mechanism_eval_registered"] is False + assert any( + "not registered in mechanism_variant_registry.json" in issue + for issue in validation["issues"] + )