Add B300 config: dsr1-fp8-sglang (non-MTP)#1050
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
|
||
| nvidia-smi | ||
|
|
||
| hf download "$MODEL" |
There was a problem hiding this comment.
🟡 The new dsr1_fp8_b300.sh calls hf download "$MODEL" at line 25, but the B300 runner (launch_b300-nv.sh) overrides MODEL to a local filesystem path (e.g. /scratch/models/DeepSeek-R1-0528) before launching the container. This causes hf download to fail on every B300 CI run, producing error noise — though the script continues since there is no set -e. The fix is to remove the hf download call, as was done correctly in the only other B300 single-node script (qwen3.5_fp8_b300_mtp.sh).
Extended reasoning...
What the bug is: dsr1_fp8_b300.sh was copied verbatim from the B200 script and includes a call to hf download "$MODEL" at line 25. On B200, MODEL remains a valid HuggingFace repo ID (e.g. deepseek-ai/DeepSeek-R1-0528), so the call succeeds. On B300, however, the runner overrides MODEL to a local path before this code runs.
The specific code path that triggers it: In launch_b300-nv.sh, the single-node (non-multinode) branch at line 220 executes: export MODEL="/scratch/models/${MODEL#*/}". This strips the HuggingFace org prefix and prepends the local scratch directory. So deepseek-ai/DeepSeek-R1-0528 becomes /scratch/models/DeepSeek-R1-0528 inside the container. When dsr1_fp8_b300.sh subsequently runs hf download "/scratch/models/DeepSeek-R1-0528", it passes an absolute filesystem path as the repo ID, which is not a valid HuggingFace repository identifier.
Why existing code doesn't prevent it: The script has no set -e, so when hf download fails, execution continues. The SGLang server launch at the bottom uses --model-path=$MODEL, which correctly references the local path — so the benchmark itself runs fine. The download failure is silently swallowed, appearing only as error noise in CI logs.
Impact: Every B300 CI run for dsr1-fp8-b300-sglang will produce an error from the failed hf download call. While not functionally blocking (the model loads from the pre-cached local path), it pollutes CI logs, can mask real errors, and violates the established B300 pattern.
How to fix it: Remove line 25 (hf download "$MODEL") from dsr1_fp8_b300.sh. This matches exactly what qwen3.5_fp8_b300_mtp.sh does — the only other B300 single-node benchmark script — which intentionally omits the download step because models are pre-cached at /scratch/models/ on B300 runners.
Step-by-step proof:
- CI triggers
dsr1-fp8-b300-sglangbenchmark. launch_b300-nv.shsetsHF_HUB_CACHE_MOUNT="/scratch/models"and runsexport MODEL="/scratch/models/${MODEL#*/}"→MODELbecomes/scratch/models/DeepSeek-R1-0528.- The container launches and
dsr1_fp8_b300.shruns. - Line 25:
hf download "/scratch/models/DeepSeek-R1-0528"— the HuggingFace CLI receives an absolute path instead of aorg/repoidentifier, rejects it as an invalid repo ID, and exits with an error. - Because there is no
set -e, execution continues past the error. - The SGLang server starts with
--model-path=/scratch/models/DeepSeek-R1-0528and loads the model correctly. - The benchmark completes, but CI logs contain the spurious
hf downloaderror on every run.
0a0446f to
d29c4a4
Compare
At the time of submission, the SGLang DSR1 cookbook does not have a B300-specific recipe, so this config reuses the existing B200 DSR1 FP8 SGLang recipe as-is until B300-specific tuning is available. Image bumped to v0.5.10.post1-cu130 to match the standard B300 SGLang image used by other B300 configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d29c4a4 to
43e2882
Compare
| export SGL_ENABLE_JIT_DEEPGEMM=false | ||
| export SGLANG_ENABLE_FLASHINFER_GEMM=true |
There was a problem hiding this comment.
🔴 dsr1_fp8_b300.sh was copied verbatim from the B200 script and is missing two B300-specific adaptations that will cause suboptimal benchmark throughput on B300 hardware. First, lines 27-28 carry over SGL_ENABLE_JIT_DEEPGEMM=false and SGLANG_ENABLE_FLASHINFER_GEMM=true from B200 — no other B300 SGLang script sets these, and SGL_ENABLE_JIT_DEEPGEMM=false actively disables a JIT GEMM path that all other B300 scripts leave enabled by default. Second, the SGLang server launch (lines 76-80) is missing --enable-symm-mem, which is present in every other B300 SGLang script (dsr1_fp4_b300.sh line 52, qwen3.5_fp8_b300.sh line 37, qwen3.5_fp8_b300_mtp.sh line 37) and enables NVLink5 symmetric memory for tensor-parallel communication. Both omissions cause this B300 config to produce lower benchmark throughput than the hardware is capable of, undermining the purpose of adding a B300 config. Fix: remove lines 27-28 and add --enable-symm-mem to the server launch command, matching the pattern of all other B300 SGLang scripts.
Extended reasoning...
What the bugs are and how they manifest
dsr1_fp8_b300.sh was copied verbatim from dsr1_fp8_b200.sh without applying two B300-specific adaptations that every other B300 SGLang benchmark script includes. The result is a B300 config that will produce benchmark throughput lower than what B300 hardware is capable of.
Bug 1 — Missing --enable-symm-mem: The SGLang server launch at lines 76-80 does not include --enable-symm-mem. This flag enables NVLink5 symmetric memory for direct tensor-parallel communication on B300 hardware, bypassing standard NCCL allreduce. Without it, the benchmark falls back to NCCL allreduce — correct behavior, but not the optimal path on B300. Every other B300 SGLang script in the repository includes this flag: dsr1_fp4_b300.sh (line 52), qwen3.5_fp8_b300.sh (line 37), and qwen3.5_fp8_b300_mtp.sh (line 37). The B200 source script (dsr1_fp8_b200.sh) does not have it because B200 lacks NVLink5 support, so the copy omits it for the wrong reason.
Bug 2 — B200-specific env vars carried into B300: Lines 27-28 set SGL_ENABLE_JIT_DEEPGEMM=false and SGLANG_ENABLE_FLASHINFER_GEMM=true. These appear in all B200 SGLang scripts but in none of the other B300 scripts. SGL_ENABLE_JIT_DEEPGEMM=false is particularly impactful: it actively disables JIT DeepGEMM compilation, which on B300 (SM_100) suppresses a hardware-specific GEMM optimization path that all other B300 benchmark scripts rely on by default. The B200 script disables JIT DeepGEMM due to SM_90 stability concerns, which do not apply to SM_100. Precedent is clear: qwen3.5_fp8_b300.sh (PR #1048) deliberately stripped both vars when adapted from its B200 counterpart.
Why the "reuse B200 recipe as-is" framing does not excuse these omissions
One verifier argues these carry-overs are intentional because the PR description says it reuses the B200 recipe "as-is". However, this argument applies differently to the two issues. For Bug 1 (--enable-symm-mem): this is not a B200 setting being carried over; it is a missing B300 setting. Choosing to "reuse the B200 recipe" does not explain away the absence of a B300 hardware-specific flag — it just means the B200 recipe never had it. For Bug 2 (env vars): even granting the intent to use the B200 recipe verbatim, SGL_ENABLE_JIT_DEEPGEMM=false is not a neutral setting — it is an active suppression of an optimization that is otherwise the default on B300. The qwen3.5 B300 adaptation demonstrates the established pattern: when porting B200→B300, these two env vars are deliberately stripped. The PR note says "recipe as-is" but the recipe comparison shows the correct B300 pattern explicitly excludes these vars.
Step-by-step proof
- All three existing B300 SGLang single-node scripts include
--enable-symm-memin their server launch commands.dsr1_fp8_b300.shdoes not. - All three existing B300 SGLang single-node scripts omit
SGL_ENABLE_JIT_DEEPGEMMandSGLANG_ENABLE_FLASHINFER_GEMM.dsr1_fp8_b300.sh(this PR) sets both. qwen3.5_fp8_b200.sh(B200) sets both env vars;qwen3.5_fp8_b300.sh(B300, PR Add B300 config: qwen3.5-fp8-sglang (non-MTP) #1048) deliberately removed them — demonstrating the intended B300 adaptation pattern.- A B300 benchmark run with this script will: (a) use NCCL allreduce instead of NVLink5 symmetric memory (suboptimal TP communication), and (b) disable JIT DeepGEMM (suppresses a GEMM optimization path active in all other B300 runs).
- Both issues produce artificially low throughput numbers, making the B300 benchmark results not representative of B300 hardware capability.
Fix
Remove lines 27-28 (SGL_ENABLE_JIT_DEEPGEMM=false and SGLANG_ENABLE_FLASHINFER_GEMM=true) and add --enable-symm-mem to the sglang.launch_server invocation, matching the pattern of dsr1_fp4_b300.sh, qwen3.5_fp8_b300.sh, and qwen3.5_fp8_b300_mtp.sh.
Summary
dsr1-fp8-b300-sglangconfig (non-MTP DeepSeek-R1 FP8 on B300 via SGLang).benchmarks/single_node/dsr1_fp8_b300.shreuses the existing B200 DSR1 FP8 SGLang recipe as-is — at the time of submission, the SGLang DSR1 cookbook does not yet have a B300-specific recipe. The note is mirrored indsr1_fp8_b300.sh,nvidia-master.yaml, andperf-changelog.yaml.lmsysorg/sglang:v0.5.10.post1-cu130to match the standard B300 SGLang image used by the Qwen3.5 B300 and DSR1 FP4 B300 configs (Add B300 config: dsr1-fp4-sglang (non-MTP) #1049).runners/launch_b300-nv.shor.github/workflows/benchmark-tmpl.yml— already wired up by Add B300 config: qwen3.5-fp8-sglang-mtp #1035.Test plan
dsr1-fp8-b300-sglangand runs 1k1k (TP=8) and 8k1k (TP=8, TP=4) search space🤖 Generated with Claude Code