Add DSv4 FP8 H200 vLLM MTP benchmark#1222
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| @@ -0,0 +1,99 @@ | |||
| #!/usr/bin/env bash | |||
There was a problem hiding this comment.
🔴 The MTP benchmark script is added at benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh, but all three H200 launch scripts (runners/launch_h200-cw.sh:47, runners/launch_h200-nb.sh:22, runners/launch_h200-dgxc-slurm.sh:295) build the script path as benchmarks/single_node/${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh where FRAMEWORK_SUFFIX is empty for vllm — so they will look for dsv4_fp8_h200_mtp.sh and fail with 'No such file or directory' on every cell of the sweep. Unlike launch_b300-nv.sh, the H200 launchers have no framework-tagged-name fallback. Fix by either renaming the script to dsv4_fp8_h200_mtp.sh (matches the existing convention — see qwen3.5_fp8_h200_mtp.sh) or porting the B300 fallback logic to the H200 launchers.
Extended reasoning...
What the bug is
The PR adds a new vLLM MTP benchmark script at benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh and a corresponding dsv4-fp8-h200-vllm-mtp config in .github/configs/nvidia-master.yaml. However, the filename does not match what the H200 launch scripts will look for at runtime, so the workflow will hard-fail before vLLM ever starts.
How the launcher resolves the script path
All three H200 launch scripts build the benchmark script path the same way:
# runners/launch_h200-cw.sh:7-8, 47
MODEL_CODE="${EXP_NAME%%_*}"
FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
...
bash benchmarks/single_node/${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.shrunners/launch_h200-nb.sh:7-8,22 is identical, and runners/launch_h200-dgxc-slurm.sh:295 inlines the same construction.
FRAMEWORK_SUFFIX is _trt only when the framework is trt; for vllm (and sglang) it is empty. SPEC_SUFFIX is _mtp when SPEC_DECODING=mtp.
Step-by-step proof for the new config
For the new dsv4-fp8-h200-vllm-mtp entry:
| variable | value |
|---|---|
model-prefix |
dsv4 |
MODEL_CODE |
dsv4 (from EXP_NAME="${model_code}_${seq_len_str}") |
PRECISION |
fp8 |
FRAMEWORK |
vllm → FRAMEWORK_SUFFIX="" |
SPEC_DECODING |
mtp → SPEC_SUFFIX="_mtp" |
So the resolved path is:
benchmarks/single_node/dsv4_fp8_h200_mtp.sh
But the PR added the file at:
benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh
bash will exit with No such file or directory, the runner will mark the cell as failed, and every cell of the new sweep (TP=8/EP=8, conc 4–64, both 1k1k and 8k1k) will fail before the engine starts.
Why the existing code does not save it
Unlike runners/launch_b300-nv.sh:267-272, which prefers a framework-tagged name and falls back to the legacy un-tagged name (this is exactly why dsv4_fp4_b300_vllm_mtp.sh and dsv4_fp4_b300_sglang_mtp.sh work on B300), the H200 launchers have no fallback — they construct one path and run it.
The existing H200 file naming convention confirms the expected name: every other vLLM/SGLang H200 MTP/non-MTP script in the tree omits the framework name (qwen3.5_fp8_h200_mtp.sh, dsr1_fp8_h200.sh, glm5_fp8_h200.sh, dsv4_fp8_h200.sh from this same series), and the only framework-tagged H200 scripts use _trt (dsr1_fp8_h200_trt_mtp.sh). The non-MTP counterpart in this PR's series — dsv4_fp8_h200.sh — already follows the no-suffix convention and works, which is itself evidence of the bug.
Impact and fix
This is a hard, deterministic PR-blocker: every cell of the new benchmark sweep fails to launch. Two fixes:
- Simplest: rename
benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh→benchmarks/single_node/dsv4_fp8_h200_mtp.shto match the existing H200 convention. - Or: port the B300-style framework-tagged-then-legacy-fallback logic to all three H200 launch scripts so framework-tagged filenames also work.
Mirror of dsv4-fp8-h200-vllm + --speculative-config
'{"method":"mtp","num_speculative_tokens":2}', so we get an MTP
counterpart of the existing H200 vLLM DeepSeek-V4-Pro recipe at
https://vllm.ai/blog/deepseek-v4.
- Image: vllm/vllm-openai:v0.20.0-cu130 (canonical v0.20.0; the
non-MTP entry is still on the deepseekv4-cu129 tag).
- Launch flags otherwise identical to dsv4_fp8_h200.sh: EP + DP=$TP,
--gpu-memory-utilization 0.95, --max-num-seqs 512,
--no-enable-flashinfer-autotune, FULL_DECODE_ONLY compile.
- run_benchmark_serving uses --dsv4 per the AGENTS.md MTP rule —
EAGLE-style spec decoding regresses acceptance on raw random tokens.
- Search space mirrors the non-MTP H200 entry (TP=8, EP=8, DP-attn,
CONC 4-64, both 1k1k and 8k1k) with spec-decoding: mtp.
Adds a perf-changelog entry to trigger the new config.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The H200 runner (runners/launch_h200-cw.sh) constructs the script name
as ${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh
where FRAMEWORK_SUFFIX is empty for vllm — so it expects
benchmarks/single_node/dsv4_fp8_h200_mtp.sh, not the framework-named
dsv4_fp8_h200_vllm_mtp.sh.
Run 12597 failed with "No such file or directory"; rename to fix it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…um_speculative_tokens=1 - Export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 before vllm serve. The estimator overshoots H200 + MTP at memory-profile time and pushes us over budget even though actual cudagraph capture works fine. - Drop num_speculative_tokens from 2 to 1 for now; bring it back up once we have a stable baseline on this image. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3f3052c to
4e6f92e
Compare
…dcoded 800k Take the max-model-len from the runner-supplied MAX_MODEL_LEN env var (added to check_env_vars) so the value is set centrally per config instead of pinned in the script. Eval-only path is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary ported h200 stp recipe to mtp
dsv4-fp8-h200-vllm-mtpconfig +benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.shscript.dsv4-fp8-h200-vllm: identical launch flags (EP + DP=$TP,--gpu-memory-utilization 0.95,--max-num-seqs 512,--no-enable-flashinfer-autotune, FULL_DECODE_ONLY compile, max-model-len 800k) with one addition —--speculative-config '{"method":"mtp","num_speculative_tokens":2}'.vllm/vllm-openai:v0.20.0-cu130(canonical v0.20.0). The non-MTPdsv4-fp8-h200-vllmentry is unchanged and still on thedeepseekv4-cu129tag.run_benchmark_servinginvocation passes--dsv4so prompts get chat-formatted encoding, per the AGENTS.md MTP rule (raw random tokens silently regress EAGLE acceptance).spec-decoding: mtpon each entry.perf-changelog.yamlentry to trigger the new config.Test plan
dsv4-fp8-h200-vllm-mtpbenchmark workflow on an H200 runner and confirm the engine starts and the sweep completes for at least one cell from each of the twoseq-len-configs.vllm/vllm-openai:v0.20.0-cu130pulls cleanly.server.logshows--speculative-config '{"method":"mtp","num_speculative_tokens":2}'and the rest of the H200 recipe flags.--dsv4is wired intorun_benchmark_servingso the prompts go through the chat template.🤖 Generated with Claude Code