Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions benchmarks/single_node/dsv4_fp4_b300_vllm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
#!/usr/bin/env bash

# Per https://vllm.ai/blog/deepseek-v4 the DeepSeek-V4-Pro recipe lists
# 8xB200 and 8xB300 with identical flags, so this script mirrors
# dsv4_fp4_b200.sh.

Check warning on line 5 in benchmarks/single_node/dsv4_fp4_b300_vllm.sh

View check run for this annotation

Claude / Claude Code Review

Misleading comment: claims to mirror dsv4_fp4_b200.sh which is actually an SGLang script

The header comment claims this script "mirrors dsv4_fp4_b200.sh" since the vLLM blog lists 8xB200 and 8xB300 with identical flags, but `dsv4_fp4_b200.sh` is an SGLang script (uses `sglang serve` with SGLang-specific flags like `--moe-runner-backend`, `--moe-a2a-backend deepep`, CONC-based recipe dispatch). This new script uses `vllm serve` with an entirely different flag set and a vLLM-specific monkey-patch — they share no engine or flags. Consider dropping the `mirrors dsv4_fp4_b200.sh` referen
Comment on lines +3 to +5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The header comment claims this script "mirrors dsv4_fp4_b200.sh" since the vLLM blog lists 8xB200 and 8xB300 with identical flags, but dsv4_fp4_b200.sh is an SGLang script (uses sglang serve with SGLang-specific flags like --moe-runner-backend, --moe-a2a-backend deepep, CONC-based recipe dispatch). This new script uses vllm serve with an entirely different flag set and a vLLM-specific monkey-patch — they share no engine or flags. Consider dropping the mirrors dsv4_fp4_b200.sh reference (or pointing it at an actual vLLM companion) so future readers don't go looking for parity that doesn't exist.

Extended reasoning...

What's wrong

Lines 3-5 of benchmarks/single_node/dsv4_fp4_b300_vllm.sh read:

# Per https://vllm.ai/blog/deepseek-v4 the DeepSeek-V4-Pro recipe lists
# 8xB200 and 8xB300 with identical flags, so this script mirrors
# dsv4_fp4_b200.sh.

The "so this script mirrors dsv4_fp4_b200.sh" conclusion only follows if dsv4_fp4_b200.sh is itself a vLLM script — but it isn't. benchmarks/single_node/dsv4_fp4_b200.sh invokes PYTHONNOUSERSITE=1 sglang serve and uses SGLang-only flags (--moe-runner-backend flashinfer_mxfp4, --moe-a2a-backend deepep, --enable-dp-attention, --deepep-config, etc.) plus a CONC-based 3-recipe dispatch (low-latency / balanced / max-throughput).

The new script doesn't mirror it

dsv4_fp4_b300_vllm.sh invokes vllm serve with a totally disjoint flag set (--kv-cache-dtype fp8, --block-size 256, --enable-expert-parallel, --data-parallel-size, --compilation-config '{...}', --tokenizer-mode deepseek_v4, --reasoning-parser deepseek_v4, etc.), monkey-patches vLLM's sparse_attn_indexer.py, has no RECIPE_FLAGS array, and no CONC-based dispatch. The two scripts share nothing beyond trivial benchmark_lib.sh boilerplate.

Why the existing wording is misleading

The vLLM blog parity claim ("B200 and B300 with identical flags") is a vLLM-to-vLLM statement and would justify mirroring a hypothetical dsv4_fp4_b200_vllm.sh — but no such file exists in the repo. The PR description even notes that this script was "restored from the abandoned origin/claude/add-dsv4-fp4-b300-vllm branch," which is a plausible explanation for the stale reference: it may originally have pointed at a never-merged vLLM b200 sibling.

Step-by-step proof

  1. Open benchmarks/single_node/dsv4_fp4_b200.sh and grep for the server invocation → line ~86 reads PYTHONNOUSERSITE=1 sglang serve \. So b200 is SGLang.
  2. Open benchmarks/single_node/dsv4_fp4_b300_vllm.sh and grep for the server invocation → it reads vllm serve $MODEL .... So b300_vllm is vLLM.
  3. Compare flag lists: b200 uses --moe-runner-backend, --moe-a2a-backend deepep, --mem-fraction-static, --disable-radix-cache, etc.; b300_vllm uses --kv-cache-dtype fp8, --block-size, --enable-expert-parallel, --data-parallel-size, --compilation-config, etc. → zero overlap.
  4. ls benchmarks/single_node/dsv4_fp4_b200_vllm.sh → no such file. So no actual vLLM b200 sibling to mirror.
  5. Conclusion: the comment is factually wrong; reading b200.sh expecting flag/structure parity will mislead.

Impact / fix

This is documentation only — no runtime behavior is affected. But the comment is the first thing a reader sees, and the PR is specifically about disambiguating engines (_sglang vs _vllm suffixes). Leaving a comment that points the vLLM script at the SGLang script undercuts that disambiguation. Suggested fix: either drop the so this script mirrors dsv4_fp4_b200.sh clause and keep just the recipe-source link, or replace it with something accurate like "flags follow the vLLM DeepSeek-V4-Pro recipe; B200 and B300 use identical flags per the blog."


source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
MAX_MODEL_LEN \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# DeepSeek-V4-Pro weights are large and engine startup on B300 can exceed
# the default 600s. Give it an hour to load.
export VLLM_ENGINE_READY_TIMEOUT_S=3600

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
fi

# Monkey-patch: bypass persistent_topk unconditionally. It raises "k out of
# range" during CUDA graph capture when the dummy batch has rows with
# seq_lens[i] < k (=2048 for DSV4). An attn_metadata.max_seq_len-based gate is
# not strict enough because dummy batches can have max >= k while individual
# rows have seq_lens[i] = 1. Fall back to top_k_per_row_decode everywhere so
# 1k/1k capture completes; 8k/1k already worked without the patch but we trade
# a small decode-time perf cost there to keep the script single-branch.
INDEXER_PY=/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sparse_attn_indexer.py
echo "[monkey-patch] patching $INDEXER_PY"
sed -i 's/if current_platform.is_cuda() and topk_tokens in (512, 1024, 2048)[^:]*:/if False: # monkey-patched: bypass persistent_topk (k out of range)/' "$INDEXER_PY"
if ! grep -Fq 'if False: # monkey-patched: bypass persistent_topk' "$INDEXER_PY"; then
echo "[monkey-patch] FAILED: expected marker not found in $INDEXER_PY" >&2
echo "[monkey-patch] current line around persistent_topk dispatch:" >&2
grep -n 'topk_tokens in\|persistent_topk' "$INDEXER_PY" >&2 || true
exit 1
fi
echo "[monkey-patch] applied: $(grep -n 'if False: # monkey-patched' $INDEXER_PY)"

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

# Per the recipe, run with EP + DP=8 (no --tensor-parallel-size flag). TP
# from the search space is used only for GPU allocation by the runner and
# as the DP size.
set -x
vllm serve $MODEL --host 0.0.0.0 --port $PORT \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--no-enable-prefix-caching \
--enable-expert-parallel \
--data-parallel-size $TP \
--max-model-len $MAX_MODEL_LEN \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 > $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
13 changes: 11 additions & 2 deletions runners/launch_b300-nv.sh
Original file line number Diff line number Diff line change
Expand Up @@ -259,8 +259,17 @@ else
export MODEL="$HF_HUB_CACHE_MOUNT/dsv4-pro"
fi
SQUASH_FILE="/data/home/sa-shared/gharunners/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
# Prefer a framework-tagged script (e.g. dsv4_fp4_b300_sglang.sh) so models
# with multiple inference engines can coexist; fall back to the historical
# name without an engine suffix (`_trt` for trt, bare for everyone else)
# for scripts that haven't been retagged yet.
BENCH_BASE="benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_b300"
BENCH_SCRIPT="${BENCH_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh"
if [[ ! -f "$BENCH_SCRIPT" ]]; then
LEGACY_FW_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')
BENCH_SCRIPT="${BENCH_BASE}${LEGACY_FW_SUFFIX}${SPEC_SUFFIX}.sh"
fi
LOCK_FILE="${SQUASH_FILE}.lock"

# TODO(Cam): the deepseek-v4 sglang images (lmsysorg/sglang:deepseek-v4-blackwell
Expand Down Expand Up @@ -300,6 +309,6 @@ else
--no-container-mount-home \
--container-workdir=$CONTAINER_MOUNT_DIR \
--no-container-entrypoint --export=ALL,PORT=8888 \
bash benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_b300${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh
bash "$BENCH_SCRIPT"

fi
Loading