Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2510,6 +2510,27 @@ dsv4-fp8-h200-vllm:
search-space:
- { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64 }

# MTP variant of dsv4-fp8-h200-vllm. Uses the canonical v0.20.0-cu130 image
# (the non-MTP entry above is still on the deepseekv4-cu129 tag) and adds
# --speculative-config '{"method":"mtp","num_speculative_tokens":2}'.
dsv4-fp8-h200-vllm-mtp:
image: vllm/vllm-openai:v0.20.0-cu130
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: h200
precision: fp8
framework: vllm
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64, spec-decoding: mtp }

# DeepSeek-V4-Pro B300 single-node aggregate recipe from the submitted B300
# pareto sweep. The single-node schema has no explicit data-parallel-size
# field, so dp-attn=true is used as the existing vLLM script switch for DP4
Expand Down
105 changes: 105 additions & 0 deletions benchmarks/single_node/dsv4_fp8_h200_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
#!/usr/bin/env bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The MTP benchmark script is added at benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh, but all three H200 launch scripts (runners/launch_h200-cw.sh:47, runners/launch_h200-nb.sh:22, runners/launch_h200-dgxc-slurm.sh:295) build the script path as benchmarks/single_node/${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh where FRAMEWORK_SUFFIX is empty for vllm — so they will look for dsv4_fp8_h200_mtp.sh and fail with 'No such file or directory' on every cell of the sweep. Unlike launch_b300-nv.sh, the H200 launchers have no framework-tagged-name fallback. Fix by either renaming the script to dsv4_fp8_h200_mtp.sh (matches the existing convention — see qwen3.5_fp8_h200_mtp.sh) or porting the B300 fallback logic to the H200 launchers.

Extended reasoning...

What the bug is

The PR adds a new vLLM MTP benchmark script at benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh and a corresponding dsv4-fp8-h200-vllm-mtp config in .github/configs/nvidia-master.yaml. However, the filename does not match what the H200 launch scripts will look for at runtime, so the workflow will hard-fail before vLLM ever starts.

How the launcher resolves the script path

All three H200 launch scripts build the benchmark script path the same way:

# runners/launch_h200-cw.sh:7-8, 47
MODEL_CODE="${EXP_NAME%%_*}"
FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
...
bash benchmarks/single_node/${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh

runners/launch_h200-nb.sh:7-8,22 is identical, and runners/launch_h200-dgxc-slurm.sh:295 inlines the same construction.

FRAMEWORK_SUFFIX is _trt only when the framework is trt; for vllm (and sglang) it is empty. SPEC_SUFFIX is _mtp when SPEC_DECODING=mtp.

Step-by-step proof for the new config

For the new dsv4-fp8-h200-vllm-mtp entry:

variable value
model-prefix dsv4
MODEL_CODE dsv4 (from EXP_NAME="${model_code}_${seq_len_str}")
PRECISION fp8
FRAMEWORK vllmFRAMEWORK_SUFFIX=""
SPEC_DECODING mtpSPEC_SUFFIX="_mtp"

So the resolved path is:

benchmarks/single_node/dsv4_fp8_h200_mtp.sh

But the PR added the file at:

benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh

bash will exit with No such file or directory, the runner will mark the cell as failed, and every cell of the new sweep (TP=8/EP=8, conc 4–64, both 1k1k and 8k1k) will fail before the engine starts.

Why the existing code does not save it

Unlike runners/launch_b300-nv.sh:267-272, which prefers a framework-tagged name and falls back to the legacy un-tagged name (this is exactly why dsv4_fp4_b300_vllm_mtp.sh and dsv4_fp4_b300_sglang_mtp.sh work on B300), the H200 launchers have no fallback — they construct one path and run it.

The existing H200 file naming convention confirms the expected name: every other vLLM/SGLang H200 MTP/non-MTP script in the tree omits the framework name (qwen3.5_fp8_h200_mtp.sh, dsr1_fp8_h200.sh, glm5_fp8_h200.sh, dsv4_fp8_h200.sh from this same series), and the only framework-tagged H200 scripts use _trt (dsr1_fp8_h200_trt_mtp.sh). The non-MTP counterpart in this PR's series — dsv4_fp8_h200.sh — already follows the no-suffix convention and works, which is itself evidence of the bug.

Impact and fix

This is a hard, deterministic PR-blocker: every cell of the new benchmark sweep fails to launch. Two fixes:

  1. Simplest: rename benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.shbenchmarks/single_node/dsv4_fp8_h200_mtp.sh to match the existing H200 convention.
  2. Or: port the B300-style framework-tagged-then-legacy-fallback logic to all three H200 launch scripts so framework-tagged filenames also work.


# DeepSeek-V4-Pro H200 vLLM MTP variant of the recipe at
# https://vllm.ai/blog/deepseek-v4. Mirrors dsv4_fp8_h200.sh but adds
# --speculative-config '{"method":"mtp","num_speculative_tokens":1}' and
# routes prompts through chat-formatted encoding via --dsv4 (required for
# meaningful MTP acceptance numbers per AGENTS.md).

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
MAX_MODEL_LEN \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

hf download "$MODEL"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# DeepSeek-V4-Pro weights are large; engine startup can exceed the default
# 600s. Give it an hour to load.
export VLLM_ENGINE_READY_TIMEOUT_S=3600

# Skip the cudagraph-memory estimator during the worker memory profiling
# phase — it overestimates and pushes us over the GPU memory budget on
# H200 + MTP, even though the actual cudagraph capture works fine.
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
MAX_MODEL_LEN_ARG="--max-model-len $EVAL_MAX_MODEL_LEN"
else
MAX_MODEL_LEN_ARG="--max-model-len $MAX_MODEL_LEN"
fi

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

# Per the recipe, run with EP + DP=8 (no --tensor-parallel-size flag). TP
# from the search space is used only for GPU allocation by the runner and
# as the DP size.
set -x
vllm serve $MODEL --host 0.0.0.0 --port $PORT \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--no-enable-prefix-caching \
--enable-expert-parallel \
--data-parallel-size $TP \
$MAX_MODEL_LEN_ARG \
--gpu-memory-utilization 0.95 \
--max-num-seqs 512 \
--max-num-batched-tokens 512 \
--no-enable-flashinfer-autotune \
--compilation-config '{"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 > $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

# MTP acceptance rate degrades on raw random tokens; --dsv4 routes prompts
# through chat-formatted encoding as required for speculative decoding benchmarks.
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code \
--dsv4

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
10 changes: 10 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1985,3 +1985,13 @@
- "Topology: 1 prefill DEP8 worker and 4 decode TP8 workers with dedicated NATS/etcd"
- "Mirrors the historical 1P4D DEP8/TP8 offload point from srt-slurm aflowers/vllm-gb200-v0.20.0"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1218

- config-keys:
- dsv4-fp8-h200-vllm-mtp
description:
- "Add DeepSeek-V4-Pro FP8 H200 vLLM MTP variant (mirrors dsv4-fp8-h200-vllm with --speculative-config {\"method\":\"mtp\",\"num_speculative_tokens\":1})"
- "Image: vllm/vllm-openai:v0.20.0-cu130"
- "Set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 to skip the cudagraph-memory estimator (it overshoots the H200 + MTP memory budget at profile time even though actual cudagraph capture works fine)"
- "run_benchmark_serving uses --dsv4 (chat-formatted prompts) per the AGENTS.md MTP rule, since EAGLE-style speculative decoding regresses acceptance on raw random tokens"
- "Search space mirrors the non-MTP H200 entry: TP=8, EP=8, DP-attn=true, CONC 4-64 for both 1k1k and 8k1k, with spec-decoding: mtp"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1222
Loading