Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1728,6 +1728,30 @@ dsr1-fp4-b300-sglang:
- { tp: 4, ep: 4, conc-start: 4, conc-end: 128 }
- { tp: 8, ep: 8, conc-start: 4, conc-end: 16 }

# DeepSeek-V4-Pro recipe from https://vllm.ai/blog/deepseek-v4
# The Pro recipe lists 8xB200 and 8xB300 with identical flags. Runs with
# DP=8 + expert parallelism (no --tensor-parallel-size flag), FP8 KV cache,
# block size 256, and an FP4 indexer cache.
# precision is tagged fp4fp8 because the model runs fp4 weights with an
# fp8 KV cache.
dsv4-fp4fp8-b300-vllm:
image: vllm/vllm-openai:deepseekv4-cu130
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: b300
precision: fp4fp8
framework: vllm
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64 }

dsr1-fp4-b200-trt:
image: nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc6.post2
model: nvidia/DeepSeek-R1-0528-FP4-V2
Expand Down
89 changes: 89 additions & 0 deletions benchmarks/single_node/dsv4_fp4fp8_b300.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
#!/usr/bin/env bash

# Per https://vllm.ai/blog/deepseek-v4 the DeepSeek-V4-Pro recipe lists
# 8xB200 and 8xB300 with identical flags, so this script mirrors
# dsv4_fp4_b200.sh.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
MAX_MODEL_LEN \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

hf download "$MODEL"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# DeepSeek-V4-Pro weights are large and engine startup on B300 can exceed
# the default 600s. Give it an hour to load.
export VLLM_ENGINE_READY_TIMEOUT_S=3600

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
fi

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

# Per the recipe, run with EP + DP=8 (no --tensor-parallel-size flag). TP
# from the search space is used only for GPU allocation by the runner and
# as the DP size.
set -x
vllm serve $MODEL --host 0.0.0.0 --port $PORT \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--no-enable-prefix-caching \
--enable-expert-parallel \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new dsv4_fp4_b300.sh hardcodes --enable-expert-parallel at line 47, violating the project's PR-review rule (.github/workflows/claude-pr-review.yml lines 141-159) that scripts must conditionally enable EP based on EP_SIZE. Every other vLLM/ATOM script in benchmarks/single_node/ uses the if [ "$EP_SIZE" -gt 1 ] pattern; please wrap the flag the same way (and add EP_SIZE to check_env_vars) so a future search-space entry with ep: 1 doesn't silently still apply expert parallelism.

Extended reasoning...

What the bug is

In benchmarks/single_node/dsv4_fp4_b300.sh line 47, the vllm serve invocation hardcodes --enable-expert-parallel unconditionally. The repo's own PR-review rules (.github/workflows/claude-pr-review.yml lines 141-159) explicitly forbid this and prescribe the canonical pattern:

if [ "$EP_SIZE" -gt 1 ]; then
  EP=" --enable-expert-parallel"
else
  EP=" "
fi

The rule is documented as a 🟡 WARNING-level review issue. Today the script also doesn't list EP_SIZE in its check_env_vars call (lines 8-16), so even if a caller exported EP_SIZE=1, the script would ignore it.

Why existing code doesn't prevent it

The sole reason runtime behavior is currently fine is that the new YAML search space (.github/configs/nvidia-master.yaml) only emits ep: 8 entries, so the runner always intends EP. There is nothing structural preventing a future ep: 1 entry — and once one is added, this script will silently still pass --enable-expert-parallel, contradicting the YAML's declared intent.

Code-path proof

  1. A future contributor adds { tp: 8, ep: 1, ... } to dsv4-fp4-b300-vllm.search-space in nvidia-master.yaml to compare TP-only vs EP performance.
  2. The runner expands the entry and exports EP_SIZE=1 into the script's environment (this is the standard contract used by every other ATOM/vLLM script in benchmarks/single_node/).
  3. The script ignores EP_SIZEcheck_env_vars doesn't list it, and the vllm serve command unconditionally has --enable-expert-parallel baked in.
  4. vLLM launches with expert parallelism on, producing perf numbers that don't match the search-space's declared ep: 1 configuration. The run is silently mislabeled in the result store.

Convention evidence

Grep over benchmarks/single_node/ shows ~24 sibling vLLM/ATOM scripts that use the conditional pattern (minimaxm2.5_fp8_b300.sh:34, minimaxm2.5_fp8_b200.sh:30, dsr1_fp4_mi355x_atom.sh, dsr1_fp4_mi355x_atom_mtp.sh, dsr1_fp8_mi355x_atom.sh, dsr1_fp8_mi355x_atom_mtp.sh, glm5_fp8_mi355x_atom.sh, glm5.1_fp4_mi355x_atom.sh, gptoss_fp4_mi355x_atom.sh, kimik2.5_fp4_mi355x_atom.sh, qwen3.5_fp8_mi355x_atom.sh, qwen3.5_fp8_mi355x_atom_mtp.sh, etc.). dsv4_fp4_b300.sh is the lone outlier.

Fix

  1. Add EP_SIZE to the check_env_vars call (lines 8-16).
  2. Above the vllm serve block, insert:
    if [ "$EP_SIZE" -gt 1 ]; then
      EP=" --enable-expert-parallel"
    else
      EP=" "
    fi
  3. Replace the hardcoded --enable-expert-parallel \ line with $EP \ (matching the pattern in the sibling scripts).

Severity rationale

The project's own review rule classifies this as WARNING (not blocking), and runtime behavior is unaffected today because the YAML always emits ep: 8. It's a convention/robustness issue rather than a current functional bug — filing as nit.

--data-parallel-size $TP \
--max-model-len $MAX_MODEL_LEN \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache=True \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 > $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
18 changes: 18 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1775,3 +1775,21 @@
- "Model: sgl-project/DeepSeek-V4-Pro-FP8"
- "https://github.com/sgl-project/sglang/pull/23608#issuecomment-4311952977"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1134

- config-keys:
- dsv4-fp4fp8-b300-vllm
description:
- "Add DeepSeek-V4-Pro vLLM B300 benchmark per https://vllm.ai/blog/deepseek-v4"
- "Image: vllm/vllm-openai:deepseekv4-cu130"
- "Model: deepseek-ai/DeepSeek-V4-Pro"
- "Recipe lists 8xB200 and 8xB300 with identical flags; this mirrors the B200 config"
- "EP + DP=8 (no --tensor-parallel-size), FP4 weights + FP8 KV cache, block size 256, FP4 indexer cache"
- "Precision tag fp4fp8 reflects fp4 weights with fp8 KV cache"
- "VLLM_ENGINE_READY_TIMEOUT_S=3600 to accommodate large weight loading"
- "Flags: --trust-remote-code, --kv-cache-dtype fp8, --block-size 256, --no-enable-prefix-caching,"
- " --enable-expert-parallel, --data-parallel-size=$TP,"
- " --compilation-config cudagraph_mode=FULL_AND_PIECEWISE custom_ops=all,"
- " --attention_config.use_fp4_indexer_cache=True, --tokenizer-mode deepseek_v4,"
- " --tool-call-parser deepseek_v4, --enable-auto-tool-choice, --reasoning-parser deepseek_v4"
- "Configs: 1k1k conc 4-64, 8k1k conc 4-64"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1128
Loading