Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1669,6 +1669,29 @@ dsr1-fp4-b200-sglang:
- { tp: 4, ep: 4, conc-start: 4, conc-end: 128 }
- { tp: 8, ep: 8, conc-start: 4, conc-end: 16 }

# NOTE: At the time of submission, https://cookbook.sglang.io/autoregressive/DeepSeek/DeepSeek-R1
# does not have a B300-specific recipe, so this config reuses the existing DSR1 FP4
# B200 SGLang recipe as-is until B300-specific tuning is available.
dsr1-fp4-b300-sglang:
image: lmsysorg/sglang:v0.5.10.post1-cu130
model: nvidia/DeepSeek-R1-0528-FP4-V2
model-prefix: dsr1
runner: b300
precision: fp4
framework: sglang
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 4, ep: 4, conc-start: 4, conc-end: 128 }
- { tp: 8, ep: 8, conc-start: 4, conc-end: 128 }
- isl: 8192
osl: 1024
search-space:
- { tp: 4, ep: 4, conc-start: 4, conc-end: 128 }
- { tp: 8, ep: 8, conc-start: 4, conc-end: 16 }

dsr1-fp4-b200-trt:
image: nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc6.post2
model: nvidia/DeepSeek-R1-0528-FP4-V2
Expand Down
81 changes: 81 additions & 0 deletions benchmarks/single_node/dsr1_fp4_b300.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#!/usr/bin/env bash

# NOTE: At the time of submission, https://cookbook.sglang.io/autoregressive/DeepSeek/DeepSeek-R1
# does not have a B300-specific recipe, so this script reuses the existing
# DSR1 FP4 B200 SGLang recipe as-is until B300-specific tuning is available.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

hf download "$MODEL"
Comment thread
claude[bot] marked this conversation as resolved.

nvidia-smi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# Default: recv every ~10 requests; if CONC ≥ 16, relax to ~30 requests between scheduler recv polls.
if [[ $CONC -ge 16 ]]; then
SCHEDULER_RECV_INTERVAL=30
else
SCHEDULER_RECV_INTERVAL=10
fi
echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL"

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path $MODEL --host 0.0.0.0 --port $PORT --trust-remote-code \
--tensor-parallel-size=$TP --data-parallel-size=1 \
--cuda-graph-max-bs 256 --max-running-requests 256 --mem-fraction-static 0.85 --kv-cache-dtype fp8_e4m3 \
--chunked-prefill-size 16384 \
--ep-size $EP_SIZE --quantization modelopt_fp4 --enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
--enable-symm-mem --disable-radix-cache --attention-backend trtllm_mla --moe-runner-backend flashinfer_trtllm --stream-interval 10 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts $((CONC * 10)) \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
8 changes: 8 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1396,3 +1396,11 @@
- "Image: lmsysorg/sglang:v0.5.10.post1-cu130"
- "TP=4, concurrency 4-256 for 1k1k and 8k1k"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1048

- config-keys:
- dsr1-fp4-b300-sglang
description:
- "Add DeepSeek-R1-0528 FP4 B300 SGLang benchmark (non-MTP)"
- "Image: lmsysorg/sglang:v0.5.10.post1-cu130"
- "At the time of submission, https://cookbook.sglang.io/autoregressive/DeepSeek/DeepSeek-R1 does not have a B300-specific recipe, so this reuses the existing DSR1 FP4 B200 SGLang recipe as-is"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1049
Comment thread
claude[bot] marked this conversation as resolved.
7 changes: 6 additions & 1 deletion runners/launch_b300-nv.sh
Original file line number Diff line number Diff line change
Expand Up @@ -216,10 +216,15 @@

else

HF_HUB_CACHE_MOUNT="/scratch/models"
export MODEL="/scratch/models/${MODEL#*/}"
# Qwen3.5-397B-A17B-FP8 is pre-staged under /scratch/models on the B300 cluster,
# so point MODEL at the local copy. Other models fall through and use `hf download`
# against the mounted cache from their benchmark script.
if [[ "$MODEL" == "Qwen/Qwen3.5-397B-A17B-FP8" ]]; then
export MODEL="/scratch/models/${MODEL#*/}"
fi
SQUASH_FILE="/data/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')

Check failure on line 227 in runners/launch_b300-nv.sh

View check run for this annotation

Claude / Claude Code Review

DSR1 FP4 B300: MODEL not rewritten to local path, benchmark will fail

The MODEL path rewrite in launch_b300-nv.sh was narrowed to only apply to Qwen/Qwen3.5-397B-A17B-FP8; for the new dsr1-fp4-b300-sglang config (model: nvidia/DeepSeek-R1-0528-FP4-V2), MODEL is never rewritten and remains the HuggingFace repo ID. Inside the B300 single-node container, HF_HUB_CACHE=/mnt/hf_hub_cache/ (set by CI) is exported via --export=ALL but /mnt/hf_hub_cache/ is never mounted — only /scratch/models:/scratch/models is mounted — so hf download nvidia/DeepSeek-R1-0528-FP4-V2 canno
Comment on lines 219 to 227
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The MODEL path rewrite in launch_b300-nv.sh was narrowed to only apply to Qwen/Qwen3.5-397B-A17B-FP8; for the new dsr1-fp4-b300-sglang config (model: nvidia/DeepSeek-R1-0528-FP4-V2), MODEL is never rewritten and remains the HuggingFace repo ID. Inside the B300 single-node container, HF_HUB_CACHE=/mnt/hf_hub_cache/ (set by CI) is exported via --export=ALL but /mnt/hf_hub_cache/ is never mounted — only /scratch/models:/scratch/models is mounted — so hf download nvidia/DeepSeek-R1-0528-FP4-V2 cannot find the pre-staged model and may attempt a ~600 GB internet download, and --model-path nvidia/DeepSeek-R1-0528-FP4-V2 will cause SGLang to fail to start. Fix: add nvidia/DeepSeek-R1-0528-FP4-V2 to the if condition in the single-node else branch so MODEL is rewritten to /scratch/models/DeepSeek-R1-0528-FP4-V2, matching the Qwen pattern and consistent with what the multinode branch already does for this model.

Extended reasoning...

What the bug is and how it manifests

PR #1035 introduced the B300 single-node runner and originally rewrote MODEL unconditionally: export MODEL="/scratch/models/${MODEL#*/}". This PR changed that to a conditional that only rewrites when MODEL == "Qwen/Qwen3.5-397B-A17B-FP8", with a comment saying other models "fall through and use hf download". For the new dsr1-fp4-b300-sglang config (model: nvidia/DeepSeek-R1-0528-FP4-V2), MODEL therefore remains the raw HuggingFace repo ID when the benchmark script is invoked.

The specific code path that triggers it

benchmark-tmpl.yml line 74 sets HF_HUB_CACHE=/mnt/hf_hub_cache/. The B300 single-node runner sets HF_HUB_CACHE_MOUNT="/scratch/models" and mounts it as --container-mounts=...,: — i.e. /scratch/models:/scratch/models. It passes --export=ALL which exports HF_HUB_CACHE=/mnt/hf_hub_cache/ into the container, but /mnt/hf_hub_cache/ is never mounted there. When dsr1_fp4_b300.sh runs inside the container, MODEL=nvidia/DeepSeek-R1-0528-FP4-V2, so line 23 (hf download "") invokes huggingface-cli against HF_HUB_CACHE=/mnt/hf_hub_cache/ (non-existent in the container). The model cannot be found and huggingface-cli will attempt a ~600 GB internet download or fail. Similarly, --model-path nvidia/DeepSeek-R1-0528-FP4-V2 on the SGLang launch command causes SGLang to attempt the same broken HF resolution path, preventing server startup entirely.

Why existing code does not prevent it

The PR comment says other models can use hf download against the mounted cache. However, the cache mount is at /scratch/models:/scratch/models, while HF_HUB_CACHE (exported from the host) points to /mnt/hf_hub_cache/. These paths are misaligned, so HF cache lookup fails inside the container. B200 correctly handles this by mounting -v : (i.e. /raid/hf_hub_cache/:/mnt/hf_hub_cache/), making the mount point match the env var. B300 does not do this — it mounts at /scratch/models but exports HF_HUB_CACHE=/mnt/hf_hub_cache/.

What the impact would be

The dsr1-fp4-b300-sglang benchmark will fail entirely: SGLang cannot locate the pre-staged model weights, so the server fails to start and all benchmark runs produce no results. At best, a ~600 GB download is attempted and times out; at worst the job fails immediately. This completely blocks the new config from producing any benchmark data.

Step-by-step proof

  1. CI runs a dsr1-fp4-b300-sglang job with model: nvidia/DeepSeek-R1-0528-FP4-V2 from nvidia-master.yaml.
  2. launch_b300-nv.sh enters the single-node else branch. The condition if [[ "" == "Qwen/Qwen3.5-397B-A17B-FP8" ]] is false, so MODEL stays as nvidia/DeepSeek-R1-0528-FP4-V2.
  3. The runner mounts /scratch/models:/scratch/models and exports HF_HUB_CACHE=/mnt/hf_hub_cache/ via --export=ALL.
  4. Inside the container: /mnt/hf_hub_cache/ is not mounted; /scratch/models is mounted but HF does not know to look there.
  5. dsr1_fp4_b300.sh line 23: hf download "nvidia/DeepSeek-R1-0528-FP4-V2" — the CLI checks HF_HUB_CACHE=/mnt/hf_hub_cache/ (missing), then falls back, and initiates a download of the ~600 GB model.
  6. --model-path nvidia/DeepSeek-R1-0528-FP4-V2 on the SGLang server launch — SGLang makes the same HF lookup, fails to find local weights, and the server cannot start.
  7. Contrast with the multinode branch of the same script (lines 23-26): it explicitly sets MODEL_PATH="/scratch/models/deepseek-r1-0528-nvfp4-v2" for dsr1-fp4, confirming the model IS pre-staged at that path on the B300 cluster.
  8. Fix: add an elif clause to rewrite MODEL to /scratch/models/DeepSeek-R1-0528-FP4-V2 for nvidia/DeepSeek-R1-0528-FP4-V2, consistent with the strip-prefix pattern used for Qwen and with the multinode branch's explicit local path.

SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')

# Pin to one of the known-good B300 nodes; others have hardware/network
Expand Down
Loading