Skip to content

Add DSv4 FP8 H200 vLLM MTP benchmark#1222

Open
functionstackx wants to merge 5 commits intomainfrom
claude/add-dsv4-fp8-h200-vllm-mtp
Open

Add DSv4 FP8 H200 vLLM MTP benchmark#1222
functionstackx wants to merge 5 commits intomainfrom
claude/add-dsv4-fp8-h200-vllm-mtp

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx commented Apr 29, 2026

Summary ported h200 stp recipe to mtp

  • New dsv4-fp8-h200-vllm-mtp config + benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh script.
  • MTP counterpart of dsv4-fp8-h200-vllm: identical launch flags (EP + DP=$TP, --gpu-memory-utilization 0.95, --max-num-seqs 512, --no-enable-flashinfer-autotune, FULL_DECODE_ONLY compile, max-model-len 800k) with one addition — --speculative-config '{"method":"mtp","num_speculative_tokens":2}'.
  • recipe said h200 is only spec token 1 but iirc inferact folks said that the kernels for h200 support spec decode token 2 now and based on @wzhao18 's vllm blackwell mtp submission, it seems like token=2 is the pareto (obv depends on the hardware SKU fwiw so not directly transferable)
  • Image: vllm/vllm-openai:v0.20.0-cu130 (canonical v0.20.0). The non-MTP dsv4-fp8-h200-vllm entry is unchanged and still on the deepseekv4-cu129 tag.
  • run_benchmark_serving invocation passes --dsv4 so prompts get chat-formatted encoding, per the AGENTS.md MTP rule (raw random tokens silently regress EAGLE acceptance).
  • Search space mirrors the non-MTP H200 entry: TP=8, EP=8, DP-attn=true, CONC 4-64, both 1k1k and 8k1k, with spec-decoding: mtp on each entry.
  • Adds a perf-changelog.yaml entry to trigger the new config.

Test plan

  • Trigger the dsv4-fp8-h200-vllm-mtp benchmark workflow on an H200 runner and confirm the engine starts and the sweep completes for at least one cell from each of the two seq-len-configs.
  • Confirm vllm/vllm-openai:v0.20.0-cu130 pulls cleanly.
  • server.log shows --speculative-config '{"method":"mtp","num_speculative_tokens":2}' and the rest of the H200 recipe flags.
  • Acceptance rate is in a sane range — --dsv4 is wired into run_benchmark_serving so the prompts go through the chat template.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@@ -0,0 +1,99 @@
#!/usr/bin/env bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The MTP benchmark script is added at benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh, but all three H200 launch scripts (runners/launch_h200-cw.sh:47, runners/launch_h200-nb.sh:22, runners/launch_h200-dgxc-slurm.sh:295) build the script path as benchmarks/single_node/${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh where FRAMEWORK_SUFFIX is empty for vllm — so they will look for dsv4_fp8_h200_mtp.sh and fail with 'No such file or directory' on every cell of the sweep. Unlike launch_b300-nv.sh, the H200 launchers have no framework-tagged-name fallback. Fix by either renaming the script to dsv4_fp8_h200_mtp.sh (matches the existing convention — see qwen3.5_fp8_h200_mtp.sh) or porting the B300 fallback logic to the H200 launchers.

Extended reasoning...

What the bug is

The PR adds a new vLLM MTP benchmark script at benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh and a corresponding dsv4-fp8-h200-vllm-mtp config in .github/configs/nvidia-master.yaml. However, the filename does not match what the H200 launch scripts will look for at runtime, so the workflow will hard-fail before vLLM ever starts.

How the launcher resolves the script path

All three H200 launch scripts build the benchmark script path the same way:

# runners/launch_h200-cw.sh:7-8, 47
MODEL_CODE="${EXP_NAME%%_*}"
FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
...
bash benchmarks/single_node/${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh

runners/launch_h200-nb.sh:7-8,22 is identical, and runners/launch_h200-dgxc-slurm.sh:295 inlines the same construction.

FRAMEWORK_SUFFIX is _trt only when the framework is trt; for vllm (and sglang) it is empty. SPEC_SUFFIX is _mtp when SPEC_DECODING=mtp.

Step-by-step proof for the new config

For the new dsv4-fp8-h200-vllm-mtp entry:

variable value
model-prefix dsv4
MODEL_CODE dsv4 (from EXP_NAME="${model_code}_${seq_len_str}")
PRECISION fp8
FRAMEWORK vllmFRAMEWORK_SUFFIX=""
SPEC_DECODING mtpSPEC_SUFFIX="_mtp"

So the resolved path is:

benchmarks/single_node/dsv4_fp8_h200_mtp.sh

But the PR added the file at:

benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh

bash will exit with No such file or directory, the runner will mark the cell as failed, and every cell of the new sweep (TP=8/EP=8, conc 4–64, both 1k1k and 8k1k) will fail before the engine starts.

Why the existing code does not save it

Unlike runners/launch_b300-nv.sh:267-272, which prefers a framework-tagged name and falls back to the legacy un-tagged name (this is exactly why dsv4_fp4_b300_vllm_mtp.sh and dsv4_fp4_b300_sglang_mtp.sh work on B300), the H200 launchers have no fallback — they construct one path and run it.

The existing H200 file naming convention confirms the expected name: every other vLLM/SGLang H200 MTP/non-MTP script in the tree omits the framework name (qwen3.5_fp8_h200_mtp.sh, dsr1_fp8_h200.sh, glm5_fp8_h200.sh, dsv4_fp8_h200.sh from this same series), and the only framework-tagged H200 scripts use _trt (dsr1_fp8_h200_trt_mtp.sh). The non-MTP counterpart in this PR's series — dsv4_fp8_h200.sh — already follows the no-suffix convention and works, which is itself evidence of the bug.

Impact and fix

This is a hard, deterministic PR-blocker: every cell of the new benchmark sweep fails to launch. Two fixes:

  1. Simplest: rename benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.shbenchmarks/single_node/dsv4_fp8_h200_mtp.sh to match the existing H200 convention.
  2. Or: port the B300-style framework-tagged-then-legacy-fallback logic to all three H200 launch scripts so framework-tagged filenames also work.

functionstackx and others added 4 commits April 29, 2026 00:47
Mirror of dsv4-fp8-h200-vllm + --speculative-config
'{"method":"mtp","num_speculative_tokens":2}', so we get an MTP
counterpart of the existing H200 vLLM DeepSeek-V4-Pro recipe at
https://vllm.ai/blog/deepseek-v4.

- Image: vllm/vllm-openai:v0.20.0-cu130 (canonical v0.20.0; the
  non-MTP entry is still on the deepseekv4-cu129 tag).
- Launch flags otherwise identical to dsv4_fp8_h200.sh: EP + DP=$TP,
  --gpu-memory-utilization 0.95, --max-num-seqs 512,
  --no-enable-flashinfer-autotune, FULL_DECODE_ONLY compile.
- run_benchmark_serving uses --dsv4 per the AGENTS.md MTP rule —
  EAGLE-style spec decoding regresses acceptance on raw random tokens.
- Search space mirrors the non-MTP H200 entry (TP=8, EP=8, DP-attn,
  CONC 4-64, both 1k1k and 8k1k) with spec-decoding: mtp.

Adds a perf-changelog entry to trigger the new config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The H200 runner (runners/launch_h200-cw.sh) constructs the script name
as ${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh
where FRAMEWORK_SUFFIX is empty for vllm — so it expects
benchmarks/single_node/dsv4_fp8_h200_mtp.sh, not the framework-named
dsv4_fp8_h200_vllm_mtp.sh.

Run 12597 failed with "No such file or directory"; rename to fix it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…um_speculative_tokens=1

- Export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 before vllm serve.
  The estimator overshoots H200 + MTP at memory-profile time and pushes
  us over budget even though actual cudagraph capture works fine.
- Drop num_speculative_tokens from 2 to 1 for now; bring it back up
  once we have a stable baseline on this image.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the claude/add-dsv4-fp8-h200-vllm-mtp branch from 3f3052c to 4e6f92e Compare April 29, 2026 04:47
…dcoded 800k

Take the max-model-len from the runner-supplied MAX_MODEL_LEN env var
(added to check_env_vars) so the value is set centrally per config
instead of pinned in the script. Eval-only path is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant