Add DSv4 FP8 H200 vLLM MTP benchmark by functionstackx · Pull Request #1222 · SemiAnalysisAI/InferenceX

functionstackx · 2026-04-29T04:12:32Z

Summary ported h200 stp recipe to mtp

New dsv4-fp8-h200-vllm-mtp config + benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh script.
MTP counterpart of dsv4-fp8-h200-vllm: identical launch flags (EP + DP=$TP, --gpu-memory-utilization 0.95, --max-num-seqs 512, --no-enable-flashinfer-autotune, FULL_DECODE_ONLY compile, max-model-len 800k) with one addition — --speculative-config '{"method":"mtp","num_speculative_tokens":2}'.
recipe said h200 is only spec token 1 but iirc inferact folks said that the kernels for h200 support spec decode token 2 now and based on @wzhao18 's vllm blackwell mtp submission, it seems like token=2 is the pareto (obv depends on the hardware SKU fwiw so not directly transferable)
Image: vllm/vllm-openai:v0.20.0-cu130 (canonical v0.20.0). The non-MTP dsv4-fp8-h200-vllm entry is unchanged and still on the deepseekv4-cu129 tag.
run_benchmark_serving invocation passes --dsv4 so prompts get chat-formatted encoding, per the AGENTS.md MTP rule (raw random tokens silently regress EAGLE acceptance).
Search space mirrors the non-MTP H200 entry: TP=8, EP=8, DP-attn=true, CONC 4-64, both 1k1k and 8k1k, with spec-decoding: mtp on each entry.
Adds a perf-changelog.yaml entry to trigger the new config.

Test plan

Trigger the dsv4-fp8-h200-vllm-mtp benchmark workflow on an H200 runner and confirm the engine starts and the sweep completes for at least one cell from each of the two seq-len-configs.
Confirm vllm/vllm-openai:v0.20.0-cu130 pulls cleanly.
server.log shows --speculative-config '{"method":"mtp","num_speculative_tokens":2}' and the rest of the H200 recipe flags.
Acceptance rate is in a sane range — --dsv4 is wired into run_benchmark_serving so the prompts go through the chat template.

🤖 Generated with Claude Code

github-actions · 2026-04-29T04:12:39Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-29T04:20:25Z

@@ -0,0 +1,99 @@
+#!/usr/bin/env bash


🔴 The MTP benchmark script is added at benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh, but all three H200 launch scripts (runners/launch_h200-cw.sh:47, runners/launch_h200-nb.sh:22, runners/launch_h200-dgxc-slurm.sh:295) build the script path as benchmarks/single_node/${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh where FRAMEWORK_SUFFIX is empty for vllm — so they will look for dsv4_fp8_h200_mtp.sh and fail with 'No such file or directory' on every cell of the sweep. Unlike launch_b300-nv.sh, the H200 launchers have no framework-tagged-name fallback. Fix by either renaming the script to dsv4_fp8_h200_mtp.sh (matches the existing convention — see qwen3.5_fp8_h200_mtp.sh) or porting the B300 fallback logic to the H200 launchers.

Extended reasoning...

What the bug is

The PR adds a new vLLM MTP benchmark script at benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh and a corresponding dsv4-fp8-h200-vllm-mtp config in .github/configs/nvidia-master.yaml. However, the filename does not match what the H200 launch scripts will look for at runtime, so the workflow will hard-fail before vLLM ever starts.

How the launcher resolves the script path

All three H200 launch scripts build the benchmark script path the same way:

# runners/launch_h200-cw.sh:7-8, 47 MODEL_CODE="${EXP_NAME%%_*}" FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '') SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '') ... bash benchmarks/single_node/${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh

runners/launch_h200-nb.sh:7-8,22 is identical, and runners/launch_h200-dgxc-slurm.sh:295 inlines the same construction.

FRAMEWORK_SUFFIX is _trt only when the framework is trt; for vllm (and sglang) it is empty. SPEC_SUFFIX is _mtp when SPEC_DECODING=mtp.

Step-by-step proof for the new config

For the new dsv4-fp8-h200-vllm-mtp entry:

variable value

model-prefix dsv4

MODEL_CODE dsv4 (from EXP_NAME="${model_code}_${seq_len_str}")

PRECISION fp8

FRAMEWORK vllm → FRAMEWORK_SUFFIX=""

SPEC_DECODING mtp → SPEC_SUFFIX="_mtp"

So the resolved path is:

benchmarks/single_node/dsv4_fp8_h200_mtp.sh

But the PR added the file at:

benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh

bash will exit with No such file or directory, the runner will mark the cell as failed, and every cell of the new sweep (TP=8/EP=8, conc 4–64, both 1k1k and 8k1k) will fail before the engine starts.

Why the existing code does not save it

Unlike runners/launch_b300-nv.sh:267-272, which prefers a framework-tagged name and falls back to the legacy un-tagged name (this is exactly why dsv4_fp4_b300_vllm_mtp.sh and dsv4_fp4_b300_sglang_mtp.sh work on B300), the H200 launchers have no fallback — they construct one path and run it.

The existing H200 file naming convention confirms the expected name: every other vLLM/SGLang H200 MTP/non-MTP script in the tree omits the framework name (qwen3.5_fp8_h200_mtp.sh, dsr1_fp8_h200.sh, glm5_fp8_h200.sh, dsv4_fp8_h200.sh from this same series), and the only framework-tagged H200 scripts use _trt (dsr1_fp8_h200_trt_mtp.sh). The non-MTP counterpart in this PR's series — dsv4_fp8_h200.sh — already follows the no-suffix convention and works, which is itself evidence of the bug.

Impact and fix

This is a hard, deterministic PR-blocker: every cell of the new benchmark sweep fails to launch. Two fixes:

Simplest: rename benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh → benchmarks/single_node/dsv4_fp8_h200_mtp.sh to match the existing H200 convention.

Or: port the B300-style framework-tagged-then-legacy-fallback logic to all three H200 launch scripts so framework-tagged filenames also work.

Mirror of dsv4-fp8-h200-vllm + --speculative-config '{"method":"mtp","num_speculative_tokens":2}', so we get an MTP counterpart of the existing H200 vLLM DeepSeek-V4-Pro recipe at https://vllm.ai/blog/deepseek-v4. - Image: vllm/vllm-openai:v0.20.0-cu130 (canonical v0.20.0; the non-MTP entry is still on the deepseekv4-cu129 tag). - Launch flags otherwise identical to dsv4_fp8_h200.sh: EP + DP=$TP, --gpu-memory-utilization 0.95, --max-num-seqs 512, --no-enable-flashinfer-autotune, FULL_DECODE_ONLY compile. - run_benchmark_serving uses --dsv4 per the AGENTS.md MTP rule — EAGLE-style spec decoding regresses acceptance on raw random tokens. - Search space mirrors the non-MTP H200 entry (TP=8, EP=8, DP-attn, CONC 4-64, both 1k1k and 8k1k) with spec-decoding: mtp. Adds a perf-changelog entry to trigger the new config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The H200 runner (runners/launch_h200-cw.sh) constructs the script name as ${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh where FRAMEWORK_SUFFIX is empty for vllm — so it expects benchmarks/single_node/dsv4_fp8_h200_mtp.sh, not the framework-named dsv4_fp8_h200_vllm_mtp.sh. Run 12597 failed with "No such file or directory"; rename to fix it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…um_speculative_tokens=1 - Export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 before vllm serve. The estimator overshoots H200 + MTP at memory-profile time and pushes us over budget even though actual cudagraph capture works fine. - Drop num_speculative_tokens from 2 to 1 for now; bring it back up once we have a stable baseline on this image. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dcoded 800k Take the max-model-len from the runner-supplied MAX_MODEL_LEN env var (added to check_env_vars) so the value is set centrally per config instead of pinned in the script. Eval-only path is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team April 29, 2026 04:12

functionstackx requested review from jgangani and kedarpotdar-nv as code owners April 29, 2026 04:12

github-project-automation Bot added this to InferenceMAX Board Apr 29, 2026

functionstackx added the full-sweep-enabled label Apr 29, 2026

claude Bot reviewed Apr 29, 2026

View reviewed changes

functionstackx added full-sweep-enabled and removed full-sweep-enabled labels Apr 29, 2026

functionstackx and others added 4 commits April 29, 2026 00:47

perf-changelog: fill in PR link for dsv4-fp8-h200-vllm-mtp

112d005

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx force-pushed the claude/add-dsv4-fp8-h200-vllm-mtp branch from 3f3052c to 4e6f92e Compare April 29, 2026 04:47

functionstackx removed the full-sweep-enabled label Apr 30, 2026

functionstackx self-assigned this Apr 30, 2026

functionstackx added the full-sweep-enabled label Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DSv4 FP8 H200 vLLM MTP benchmark#1222

Add DSv4 FP8 H200 vLLM MTP benchmark#1222
functionstackx wants to merge 5 commits intomainfrom
claude/add-dsv4-fp8-h200-vllm-mtp

functionstackx commented Apr 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

claude Bot Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

variable	value
`model-prefix`	`dsv4`
`MODEL_CODE`	`dsv4` (from `EXP_NAME="${model_code}_${seq_len_str}"`)
`PRECISION`	`fp8`
`FRAMEWORK`	`vllm` → `FRAMEWORK_SUFFIX=""`
`SPEC_DECODING`	`mtp` → `SPEC_SUFFIX="_mtp"`

Conversation

functionstackx commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary ported h200 stp recipe to mtp

Test plan

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

claude Bot Apr 29, 2026

Choose a reason for hiding this comment

What the bug is

How the launcher resolves the script path

Step-by-step proof for the new config

Why the existing code does not save it

Impact and fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

functionstackx commented Apr 29, 2026 •

edited

Loading