-
Notifications
You must be signed in to change notification settings - Fork 155
Add DSv4 FP8 H200 vLLM MTP benchmark #1222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
functionstackx
wants to merge
5
commits into
main
Choose a base branch
from
claude/add-dsv4-fp8-h200-vllm-mtp
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+136
−0
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
d2d42f8
Add DSv4 FP8 H200 vLLM MTP benchmark
functionstackx 112d005
perf-changelog: fill in PR link for dsv4-fp8-h200-vllm-mtp
functionstackx 5461147
dsv4-fp8-h200-vllm-mtp: rename script to match H200 runner convention
functionstackx 4e6f92e
dsv4-fp8-h200-vllm-mtp: VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0, n…
functionstackx e71f6f1
dsv4-fp8-h200-vllm-mtp: use $MAX_MODEL_LEN from runner instead of har…
functionstackx File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,105 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # DeepSeek-V4-Pro H200 vLLM MTP variant of the recipe at | ||
| # https://vllm.ai/blog/deepseek-v4. Mirrors dsv4_fp8_h200.sh but adds | ||
| # --speculative-config '{"method":"mtp","num_speculative_tokens":1}' and | ||
| # routes prompts through chat-formatted encoding via --dsv4 (required for | ||
| # meaningful MTP acceptance numbers per AGENTS.md). | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| MAX_MODEL_LEN \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| # DeepSeek-V4-Pro weights are large; engine startup can exceed the default | ||
| # 600s. Give it an hour to load. | ||
| export VLLM_ENGINE_READY_TIMEOUT_S=3600 | ||
|
|
||
| # Skip the cudagraph-memory estimator during the worker memory profiling | ||
| # phase — it overestimates and pushes us over the GPU memory budget on | ||
| # H200 + MTP, even though the actual cudagraph capture works fine. | ||
| export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 | ||
|
|
||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| MAX_MODEL_LEN_ARG="--max-model-len $EVAL_MAX_MODEL_LEN" | ||
| else | ||
| MAX_MODEL_LEN_ARG="--max-model-len $MAX_MODEL_LEN" | ||
| fi | ||
|
|
||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
||
| # Per the recipe, run with EP + DP=8 (no --tensor-parallel-size flag). TP | ||
| # from the search space is used only for GPU allocation by the runner and | ||
| # as the DP size. | ||
| set -x | ||
| vllm serve $MODEL --host 0.0.0.0 --port $PORT \ | ||
| --trust-remote-code \ | ||
| --kv-cache-dtype fp8 \ | ||
| --block-size 256 \ | ||
| --no-enable-prefix-caching \ | ||
| --enable-expert-parallel \ | ||
| --data-parallel-size $TP \ | ||
| $MAX_MODEL_LEN_ARG \ | ||
| --gpu-memory-utilization 0.95 \ | ||
| --max-num-seqs 512 \ | ||
| --max-num-batched-tokens 512 \ | ||
| --no-enable-flashinfer-autotune \ | ||
| --compilation-config '{"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY"}' \ | ||
| --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \ | ||
| --tokenizer-mode deepseek_v4 \ | ||
| --tool-call-parser deepseek_v4 \ | ||
| --enable-auto-tool-choice \ | ||
| --reasoning-parser deepseek_v4 > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| # MTP acceptance rate degrades on raw random tokens; --dsv4 routes prompts | ||
| # through chat-formatted encoding as required for speculative decoding benchmarks. | ||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --trust-remote-code \ | ||
| --dsv4 | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The MTP benchmark script is added at
benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh, but all three H200 launch scripts (runners/launch_h200-cw.sh:47,runners/launch_h200-nb.sh:22,runners/launch_h200-dgxc-slurm.sh:295) build the script path asbenchmarks/single_node/${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.shwhere FRAMEWORK_SUFFIX is empty for vllm — so they will look fordsv4_fp8_h200_mtp.shand fail with 'No such file or directory' on every cell of the sweep. Unlikelaunch_b300-nv.sh, the H200 launchers have no framework-tagged-name fallback. Fix by either renaming the script todsv4_fp8_h200_mtp.sh(matches the existing convention — seeqwen3.5_fp8_h200_mtp.sh) or porting the B300 fallback logic to the H200 launchers.Extended reasoning...
What the bug is
The PR adds a new vLLM MTP benchmark script at
benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.shand a correspondingdsv4-fp8-h200-vllm-mtpconfig in.github/configs/nvidia-master.yaml. However, the filename does not match what the H200 launch scripts will look for at runtime, so the workflow will hard-fail before vLLM ever starts.How the launcher resolves the script path
All three H200 launch scripts build the benchmark script path the same way:
runners/launch_h200-nb.sh:7-8,22is identical, andrunners/launch_h200-dgxc-slurm.sh:295inlines the same construction.FRAMEWORK_SUFFIXis_trtonly when the framework istrt; forvllm(andsglang) it is empty.SPEC_SUFFIXis_mtpwhenSPEC_DECODING=mtp.Step-by-step proof for the new config
For the new
dsv4-fp8-h200-vllm-mtpentry:model-prefixdsv4MODEL_CODEdsv4(fromEXP_NAME="${model_code}_${seq_len_str}")PRECISIONfp8FRAMEWORKvllm→FRAMEWORK_SUFFIX=""SPEC_DECODINGmtp→SPEC_SUFFIX="_mtp"So the resolved path is:
But the PR added the file at:
bashwill exit withNo such file or directory, the runner will mark the cell as failed, and every cell of the new sweep (TP=8/EP=8, conc 4–64, both 1k1k and 8k1k) will fail before the engine starts.Why the existing code does not save it
Unlike
runners/launch_b300-nv.sh:267-272, which prefers a framework-tagged name and falls back to the legacy un-tagged name (this is exactly whydsv4_fp4_b300_vllm_mtp.shanddsv4_fp4_b300_sglang_mtp.shwork on B300), the H200 launchers have no fallback — they construct one path and run it.The existing H200 file naming convention confirms the expected name: every other vLLM/SGLang H200 MTP/non-MTP script in the tree omits the framework name (
qwen3.5_fp8_h200_mtp.sh,dsr1_fp8_h200.sh,glm5_fp8_h200.sh,dsv4_fp8_h200.shfrom this same series), and the only framework-tagged H200 scripts use_trt(dsr1_fp8_h200_trt_mtp.sh). The non-MTP counterpart in this PR's series —dsv4_fp8_h200.sh— already follows the no-suffix convention and works, which is itself evidence of the bug.Impact and fix
This is a hard, deterministic PR-blocker: every cell of the new benchmark sweep fails to launch. Two fixes:
benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh→benchmarks/single_node/dsv4_fp8_h200_mtp.shto match the existing H200 convention.