Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1774,6 +1774,28 @@ dsr1-fp8-b200-sglang:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 4 }
- { tp: 4, ep: 1, conc-start: 4, conc-end: 32 }

# NOTE: At the time of submission, https://cookbook.sglang.io/autoregressive/DeepSeek/DeepSeek-R1
# does not have a B300-specific recipe, so this config reuses the existing DSR1 FP8
# B200 SGLang recipe as-is until B300-specific tuning is available.
dsr1-fp8-b300-sglang:
image: lmsysorg/sglang:v0.5.10.post1-cu130
model: deepseek-ai/DeepSeek-R1-0528
model-prefix: dsr1
runner: b300
precision: fp8
framework: sglang
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 4 }
- { tp: 4, ep: 1, conc-start: 4, conc-end: 32 }

qwen3.5-bf16-b200-sglang:
image: lmsysorg/sglang:nightly-dev-20260216-d3bae71e
model: Qwen/Qwen3.5-397B-A17B
Expand Down
113 changes: 113 additions & 0 deletions benchmarks/single_node/dsr1_fp8_b300.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
#!/usr/bin/env bash

# NOTE: At the time of submission, https://cookbook.sglang.io/autoregressive/DeepSeek/DeepSeek-R1
# does not have a B300-specific recipe, so this script reuses the existing
# DSR1 FP8 B200 SGLang recipe as-is until B300-specific tuning is available.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

hf download "$MODEL"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new dsr1_fp8_b300.sh calls hf download "$MODEL" at line 25, but the B300 runner (launch_b300-nv.sh) overrides MODEL to a local filesystem path (e.g. /scratch/models/DeepSeek-R1-0528) before launching the container. This causes hf download to fail on every B300 CI run, producing error noise — though the script continues since there is no set -e. The fix is to remove the hf download call, as was done correctly in the only other B300 single-node script (qwen3.5_fp8_b300_mtp.sh).

Extended reasoning...

What the bug is: dsr1_fp8_b300.sh was copied verbatim from the B200 script and includes a call to hf download "$MODEL" at line 25. On B200, MODEL remains a valid HuggingFace repo ID (e.g. deepseek-ai/DeepSeek-R1-0528), so the call succeeds. On B300, however, the runner overrides MODEL to a local path before this code runs.

The specific code path that triggers it: In launch_b300-nv.sh, the single-node (non-multinode) branch at line 220 executes: export MODEL="/scratch/models/${MODEL#*/}". This strips the HuggingFace org prefix and prepends the local scratch directory. So deepseek-ai/DeepSeek-R1-0528 becomes /scratch/models/DeepSeek-R1-0528 inside the container. When dsr1_fp8_b300.sh subsequently runs hf download "/scratch/models/DeepSeek-R1-0528", it passes an absolute filesystem path as the repo ID, which is not a valid HuggingFace repository identifier.

Why existing code doesn't prevent it: The script has no set -e, so when hf download fails, execution continues. The SGLang server launch at the bottom uses --model-path=$MODEL, which correctly references the local path — so the benchmark itself runs fine. The download failure is silently swallowed, appearing only as error noise in CI logs.

Impact: Every B300 CI run for dsr1-fp8-b300-sglang will produce an error from the failed hf download call. While not functionally blocking (the model loads from the pre-cached local path), it pollutes CI logs, can mask real errors, and violates the established B300 pattern.

How to fix it: Remove line 25 (hf download "$MODEL") from dsr1_fp8_b300.sh. This matches exactly what qwen3.5_fp8_b300_mtp.sh does — the only other B300 single-node benchmark script — which intentionally omits the download step because models are pre-cached at /scratch/models/ on B300 runners.

Step-by-step proof:

  1. CI triggers dsr1-fp8-b300-sglang benchmark.
  2. launch_b300-nv.sh sets HF_HUB_CACHE_MOUNT="/scratch/models" and runs export MODEL="/scratch/models/${MODEL#*/}"MODEL becomes /scratch/models/DeepSeek-R1-0528.
  3. The container launches and dsr1_fp8_b300.sh runs.
  4. Line 25: hf download "/scratch/models/DeepSeek-R1-0528" — the HuggingFace CLI receives an absolute path instead of a org/repo identifier, rejects it as an invalid repo ID, and exits with an error.
  5. Because there is no set -e, execution continues past the error.
  6. The SGLang server starts with --model-path=/scratch/models/DeepSeek-R1-0528 and loads the model correctly.
  7. The benchmark completes, but CI logs contain the spurious hf download error on every run.


export SGL_ENABLE_JIT_DEEPGEMM=false
export SGLANG_ENABLE_FLASHINFER_GEMM=true

Check failure on line 28 in benchmarks/single_node/dsr1_fp8_b300.sh

View check run for this annotation

Claude / Claude Code Review

dsr1_fp8_b300.sh missing B300 adaptations: --enable-symm-mem and B200-only env vars

dsr1_fp8_b300.sh was copied verbatim from the B200 script and is missing two B300-specific adaptations that will cause suboptimal benchmark throughput on B300 hardware. First, lines 27-28 carry over `SGL_ENABLE_JIT_DEEPGEMM=false` and `SGLANG_ENABLE_FLASHINFER_GEMM=true` from B200 — no other B300 SGLang script sets these, and `SGL_ENABLE_JIT_DEEPGEMM=false` actively disables a JIT GEMM path that all other B300 scripts leave enabled by default. Second, the SGLang server launch (lines 76-80) is mi
Comment on lines +27 to +28
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 dsr1_fp8_b300.sh was copied verbatim from the B200 script and is missing two B300-specific adaptations that will cause suboptimal benchmark throughput on B300 hardware. First, lines 27-28 carry over SGL_ENABLE_JIT_DEEPGEMM=false and SGLANG_ENABLE_FLASHINFER_GEMM=true from B200 — no other B300 SGLang script sets these, and SGL_ENABLE_JIT_DEEPGEMM=false actively disables a JIT GEMM path that all other B300 scripts leave enabled by default. Second, the SGLang server launch (lines 76-80) is missing --enable-symm-mem, which is present in every other B300 SGLang script (dsr1_fp4_b300.sh line 52, qwen3.5_fp8_b300.sh line 37, qwen3.5_fp8_b300_mtp.sh line 37) and enables NVLink5 symmetric memory for tensor-parallel communication. Both omissions cause this B300 config to produce lower benchmark throughput than the hardware is capable of, undermining the purpose of adding a B300 config. Fix: remove lines 27-28 and add --enable-symm-mem to the server launch command, matching the pattern of all other B300 SGLang scripts.

Extended reasoning...

What the bugs are and how they manifest

dsr1_fp8_b300.sh was copied verbatim from dsr1_fp8_b200.sh without applying two B300-specific adaptations that every other B300 SGLang benchmark script includes. The result is a B300 config that will produce benchmark throughput lower than what B300 hardware is capable of.

Bug 1 — Missing --enable-symm-mem: The SGLang server launch at lines 76-80 does not include --enable-symm-mem. This flag enables NVLink5 symmetric memory for direct tensor-parallel communication on B300 hardware, bypassing standard NCCL allreduce. Without it, the benchmark falls back to NCCL allreduce — correct behavior, but not the optimal path on B300. Every other B300 SGLang script in the repository includes this flag: dsr1_fp4_b300.sh (line 52), qwen3.5_fp8_b300.sh (line 37), and qwen3.5_fp8_b300_mtp.sh (line 37). The B200 source script (dsr1_fp8_b200.sh) does not have it because B200 lacks NVLink5 support, so the copy omits it for the wrong reason.

Bug 2 — B200-specific env vars carried into B300: Lines 27-28 set SGL_ENABLE_JIT_DEEPGEMM=false and SGLANG_ENABLE_FLASHINFER_GEMM=true. These appear in all B200 SGLang scripts but in none of the other B300 scripts. SGL_ENABLE_JIT_DEEPGEMM=false is particularly impactful: it actively disables JIT DeepGEMM compilation, which on B300 (SM_100) suppresses a hardware-specific GEMM optimization path that all other B300 benchmark scripts rely on by default. The B200 script disables JIT DeepGEMM due to SM_90 stability concerns, which do not apply to SM_100. Precedent is clear: qwen3.5_fp8_b300.sh (PR #1048) deliberately stripped both vars when adapted from its B200 counterpart.

Why the "reuse B200 recipe as-is" framing does not excuse these omissions

One verifier argues these carry-overs are intentional because the PR description says it reuses the B200 recipe "as-is". However, this argument applies differently to the two issues. For Bug 1 (--enable-symm-mem): this is not a B200 setting being carried over; it is a missing B300 setting. Choosing to "reuse the B200 recipe" does not explain away the absence of a B300 hardware-specific flag — it just means the B200 recipe never had it. For Bug 2 (env vars): even granting the intent to use the B200 recipe verbatim, SGL_ENABLE_JIT_DEEPGEMM=false is not a neutral setting — it is an active suppression of an optimization that is otherwise the default on B300. The qwen3.5 B300 adaptation demonstrates the established pattern: when porting B200→B300, these two env vars are deliberately stripped. The PR note says "recipe as-is" but the recipe comparison shows the correct B300 pattern explicitly excludes these vars.

Step-by-step proof

  1. All three existing B300 SGLang single-node scripts include --enable-symm-mem in their server launch commands. dsr1_fp8_b300.sh does not.
  2. All three existing B300 SGLang single-node scripts omit SGL_ENABLE_JIT_DEEPGEMM and SGLANG_ENABLE_FLASHINFER_GEMM. dsr1_fp8_b300.sh (this PR) sets both.
  3. qwen3.5_fp8_b200.sh (B200) sets both env vars; qwen3.5_fp8_b300.sh (B300, PR Add B300 config: qwen3.5-fp8-sglang (non-MTP) #1048) deliberately removed them — demonstrating the intended B300 adaptation pattern.
  4. A B300 benchmark run with this script will: (a) use NCCL allreduce instead of NVLink5 symmetric memory (suboptimal TP communication), and (b) disable JIT DeepGEMM (suppresses a GEMM optimization path active in all other B300 runs).
  5. Both issues produce artificially low throughput numbers, making the B300 benchmark results not representative of B300 hardware capability.

Fix

Remove lines 27-28 (SGL_ENABLE_JIT_DEEPGEMM=false and SGLANG_ENABLE_FLASHINFER_GEMM=true) and add --enable-symm-mem to the sglang.launch_server invocation, matching the pattern of dsr1_fp4_b300.sh, qwen3.5_fp8_b300.sh, and qwen3.5_fp8_b300_mtp.sh.

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# Default: recv every ~10 requests; if CONC ≥ 16, relax to ~30 requests between scheduler recv polls.
if [[ $TP -eq 8 ]]; then
if [[ $CONC -ge 16 ]]; then
SCHEDULER_RECV_INTERVAL=30
else
SCHEDULER_RECV_INTERVAL=10
fi

# Setting these values (passed in to --cuda-graph-max-bs and --max-running-requests) as the maximum concurrency
# this will help us save memory from being unnecessary used.
MAX_RUNNING_REQUESTS=128
CUDA_GRAPH_MAX_BATCH_SIZE=128

MEM_FRAC_STATIC=0.82
CHUNKED_PREFILL_SIZE=32768
MAX_PREFILL_TOKENS=32768
elif [[ $TP -eq 4 ]]; then
if [[ $ISL -ne 8192 ]] || [[ $OSL -ne 1024 ]]; then
echo "TP=4 not yet supported for ISL=$ISL OSL=$OSL!"
exit 1
fi

# Setting these values (passed in to --cuda-graph-max-bs and --max-running-requests) as the maximum concurrency
# this will help us save memory from being unnecessary used.
MAX_RUNNING_REQUESTS=32
CUDA_GRAPH_MAX_BATCH_SIZE=32

MEM_FRAC_STATIC=0.95
CHUNKED_PREFILL_SIZE=8192
MAX_PREFILL_TOKENS=8192

SCHEDULER_RECV_INTERVAL=10
else
echo "Unrecognized TP size $TP!"
exit 1
fi
echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL"

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
--tensor-parallel-size=$TP --data-parallel-size=1 \
--cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
--mem-fraction-static $MEM_FRAC_STATIC --kv-cache-dtype fp8_e4m3 --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \
--enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL --disable-radix-cache \
--attention-backend trtllm_mla --stream-interval 30 --ep-size $EP_SIZE --moe-runner-backend flashinfer_trtllm --quantization fp8 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
8 changes: 8 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1404,3 +1404,11 @@
- "Image: lmsysorg/sglang:v0.5.10.post1-cu130"
- "At the time of submission, https://cookbook.sglang.io/autoregressive/DeepSeek/DeepSeek-R1 does not have a B300-specific recipe, so this reuses the existing DSR1 FP4 B200 SGLang recipe as-is"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1049

- config-keys:
- dsr1-fp8-b300-sglang
description:
- "Add DeepSeek-R1-0528 FP8 B300 SGLang benchmark (non-MTP)"
- "Image: lmsysorg/sglang:v0.5.10.post1-cu130"
- "At the time of submission, https://cookbook.sglang.io/autoregressive/DeepSeek/DeepSeek-R1 does not have a B300-specific recipe, so this reuses the existing DSR1 FP8 B200 SGLang recipe as-is"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1050
Comment thread
claude[bot] marked this conversation as resolved.
Loading