Skip to content

Add B300 config: dsr1-fp8-sglang (non-MTP)#1050

Merged
functionstackx merged 2 commits intomainfrom
claude/add-dsr1-fp8-b300-sglang
Apr 17, 2026
Merged

Add B300 config: dsr1-fp8-sglang (non-MTP)#1050
functionstackx merged 2 commits intomainfrom
claude/add-dsr1-fp8-b300-sglang

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

Summary

  • Adds dsr1-fp8-b300-sglang config (non-MTP DeepSeek-R1 FP8 on B300 via SGLang).
  • New benchmark script benchmarks/single_node/dsr1_fp8_b300.sh reuses the existing B200 DSR1 FP8 SGLang recipe as-is — at the time of submission, the SGLang DSR1 cookbook does not yet have a B300-specific recipe. The note is mirrored in dsr1_fp8_b300.sh, nvidia-master.yaml, and perf-changelog.yaml.
  • Image bumped to lmsysorg/sglang:v0.5.10.post1-cu130 to match the standard B300 SGLang image used by the Qwen3.5 B300 and DSR1 FP4 B300 configs (Add B300 config: dsr1-fp4-sglang (non-MTP) #1049).
  • No changes to runners/launch_b300-nv.sh or .github/workflows/benchmark-tmpl.yml — already wired up by Add B300 config: qwen3.5-fp8-sglang-mtp #1035.

Test plan

  • Sweep picks up dsr1-fp8-b300-sglang and runs 1k1k (TP=8) and 8k1k (TP=8, TP=4) search space
  • Results publish to inferencex.com and look sane relative to B200 DSR1 FP8 SGLang

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.


nvidia-smi

hf download "$MODEL"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new dsr1_fp8_b300.sh calls hf download "$MODEL" at line 25, but the B300 runner (launch_b300-nv.sh) overrides MODEL to a local filesystem path (e.g. /scratch/models/DeepSeek-R1-0528) before launching the container. This causes hf download to fail on every B300 CI run, producing error noise — though the script continues since there is no set -e. The fix is to remove the hf download call, as was done correctly in the only other B300 single-node script (qwen3.5_fp8_b300_mtp.sh).

Extended reasoning...

What the bug is: dsr1_fp8_b300.sh was copied verbatim from the B200 script and includes a call to hf download "$MODEL" at line 25. On B200, MODEL remains a valid HuggingFace repo ID (e.g. deepseek-ai/DeepSeek-R1-0528), so the call succeeds. On B300, however, the runner overrides MODEL to a local path before this code runs.

The specific code path that triggers it: In launch_b300-nv.sh, the single-node (non-multinode) branch at line 220 executes: export MODEL="/scratch/models/${MODEL#*/}". This strips the HuggingFace org prefix and prepends the local scratch directory. So deepseek-ai/DeepSeek-R1-0528 becomes /scratch/models/DeepSeek-R1-0528 inside the container. When dsr1_fp8_b300.sh subsequently runs hf download "/scratch/models/DeepSeek-R1-0528", it passes an absolute filesystem path as the repo ID, which is not a valid HuggingFace repository identifier.

Why existing code doesn't prevent it: The script has no set -e, so when hf download fails, execution continues. The SGLang server launch at the bottom uses --model-path=$MODEL, which correctly references the local path — so the benchmark itself runs fine. The download failure is silently swallowed, appearing only as error noise in CI logs.

Impact: Every B300 CI run for dsr1-fp8-b300-sglang will produce an error from the failed hf download call. While not functionally blocking (the model loads from the pre-cached local path), it pollutes CI logs, can mask real errors, and violates the established B300 pattern.

How to fix it: Remove line 25 (hf download "$MODEL") from dsr1_fp8_b300.sh. This matches exactly what qwen3.5_fp8_b300_mtp.sh does — the only other B300 single-node benchmark script — which intentionally omits the download step because models are pre-cached at /scratch/models/ on B300 runners.

Step-by-step proof:

  1. CI triggers dsr1-fp8-b300-sglang benchmark.
  2. launch_b300-nv.sh sets HF_HUB_CACHE_MOUNT="/scratch/models" and runs export MODEL="/scratch/models/${MODEL#*/}"MODEL becomes /scratch/models/DeepSeek-R1-0528.
  3. The container launches and dsr1_fp8_b300.sh runs.
  4. Line 25: hf download "/scratch/models/DeepSeek-R1-0528" — the HuggingFace CLI receives an absolute path instead of a org/repo identifier, rejects it as an invalid repo ID, and exits with an error.
  5. Because there is no set -e, execution continues past the error.
  6. The SGLang server starts with --model-path=/scratch/models/DeepSeek-R1-0528 and loads the model correctly.
  7. The benchmark completes, but CI logs contain the spurious hf download error on every run.

Comment thread perf-changelog.yaml
@functionstackx functionstackx marked this pull request as draft April 17, 2026 06:44
@functionstackx functionstackx force-pushed the claude/add-dsr1-fp8-b300-sglang branch from 0a0446f to d29c4a4 Compare April 17, 2026 08:04
functionstackx and others added 2 commits April 17, 2026 04:19
At the time of submission, the SGLang DSR1 cookbook does not have a
B300-specific recipe, so this config reuses the existing B200 DSR1 FP8
SGLang recipe as-is until B300-specific tuning is available. Image bumped
to v0.5.10.post1-cu130 to match the standard B300 SGLang image used by
other B300 configs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the claude/add-dsr1-fp8-b300-sglang branch from d29c4a4 to 43e2882 Compare April 17, 2026 08:19
@functionstackx functionstackx marked this pull request as ready for review April 17, 2026 08:28
Comment on lines +27 to +28
export SGL_ENABLE_JIT_DEEPGEMM=false
export SGLANG_ENABLE_FLASHINFER_GEMM=true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 dsr1_fp8_b300.sh was copied verbatim from the B200 script and is missing two B300-specific adaptations that will cause suboptimal benchmark throughput on B300 hardware. First, lines 27-28 carry over SGL_ENABLE_JIT_DEEPGEMM=false and SGLANG_ENABLE_FLASHINFER_GEMM=true from B200 — no other B300 SGLang script sets these, and SGL_ENABLE_JIT_DEEPGEMM=false actively disables a JIT GEMM path that all other B300 scripts leave enabled by default. Second, the SGLang server launch (lines 76-80) is missing --enable-symm-mem, which is present in every other B300 SGLang script (dsr1_fp4_b300.sh line 52, qwen3.5_fp8_b300.sh line 37, qwen3.5_fp8_b300_mtp.sh line 37) and enables NVLink5 symmetric memory for tensor-parallel communication. Both omissions cause this B300 config to produce lower benchmark throughput than the hardware is capable of, undermining the purpose of adding a B300 config. Fix: remove lines 27-28 and add --enable-symm-mem to the server launch command, matching the pattern of all other B300 SGLang scripts.

Extended reasoning...

What the bugs are and how they manifest

dsr1_fp8_b300.sh was copied verbatim from dsr1_fp8_b200.sh without applying two B300-specific adaptations that every other B300 SGLang benchmark script includes. The result is a B300 config that will produce benchmark throughput lower than what B300 hardware is capable of.

Bug 1 — Missing --enable-symm-mem: The SGLang server launch at lines 76-80 does not include --enable-symm-mem. This flag enables NVLink5 symmetric memory for direct tensor-parallel communication on B300 hardware, bypassing standard NCCL allreduce. Without it, the benchmark falls back to NCCL allreduce — correct behavior, but not the optimal path on B300. Every other B300 SGLang script in the repository includes this flag: dsr1_fp4_b300.sh (line 52), qwen3.5_fp8_b300.sh (line 37), and qwen3.5_fp8_b300_mtp.sh (line 37). The B200 source script (dsr1_fp8_b200.sh) does not have it because B200 lacks NVLink5 support, so the copy omits it for the wrong reason.

Bug 2 — B200-specific env vars carried into B300: Lines 27-28 set SGL_ENABLE_JIT_DEEPGEMM=false and SGLANG_ENABLE_FLASHINFER_GEMM=true. These appear in all B200 SGLang scripts but in none of the other B300 scripts. SGL_ENABLE_JIT_DEEPGEMM=false is particularly impactful: it actively disables JIT DeepGEMM compilation, which on B300 (SM_100) suppresses a hardware-specific GEMM optimization path that all other B300 benchmark scripts rely on by default. The B200 script disables JIT DeepGEMM due to SM_90 stability concerns, which do not apply to SM_100. Precedent is clear: qwen3.5_fp8_b300.sh (PR #1048) deliberately stripped both vars when adapted from its B200 counterpart.

Why the "reuse B200 recipe as-is" framing does not excuse these omissions

One verifier argues these carry-overs are intentional because the PR description says it reuses the B200 recipe "as-is". However, this argument applies differently to the two issues. For Bug 1 (--enable-symm-mem): this is not a B200 setting being carried over; it is a missing B300 setting. Choosing to "reuse the B200 recipe" does not explain away the absence of a B300 hardware-specific flag — it just means the B200 recipe never had it. For Bug 2 (env vars): even granting the intent to use the B200 recipe verbatim, SGL_ENABLE_JIT_DEEPGEMM=false is not a neutral setting — it is an active suppression of an optimization that is otherwise the default on B300. The qwen3.5 B300 adaptation demonstrates the established pattern: when porting B200→B300, these two env vars are deliberately stripped. The PR note says "recipe as-is" but the recipe comparison shows the correct B300 pattern explicitly excludes these vars.

Step-by-step proof

  1. All three existing B300 SGLang single-node scripts include --enable-symm-mem in their server launch commands. dsr1_fp8_b300.sh does not.
  2. All three existing B300 SGLang single-node scripts omit SGL_ENABLE_JIT_DEEPGEMM and SGLANG_ENABLE_FLASHINFER_GEMM. dsr1_fp8_b300.sh (this PR) sets both.
  3. qwen3.5_fp8_b200.sh (B200) sets both env vars; qwen3.5_fp8_b300.sh (B300, PR Add B300 config: qwen3.5-fp8-sglang (non-MTP) #1048) deliberately removed them — demonstrating the intended B300 adaptation pattern.
  4. A B300 benchmark run with this script will: (a) use NCCL allreduce instead of NVLink5 symmetric memory (suboptimal TP communication), and (b) disable JIT DeepGEMM (suppresses a GEMM optimization path active in all other B300 runs).
  5. Both issues produce artificially low throughput numbers, making the B300 benchmark results not representative of B300 hardware capability.

Fix

Remove lines 27-28 (SGL_ENABLE_JIT_DEEPGEMM=false and SGLANG_ENABLE_FLASHINFER_GEMM=true) and add --enable-symm-mem to the sglang.launch_server invocation, matching the pattern of dsr1_fp4_b300.sh, qwen3.5_fp8_b300.sh, and qwen3.5_fp8_b300_mtp.sh.

@functionstackx functionstackx merged commit 5b302f1 into main Apr 17, 2026
52 checks passed
@functionstackx functionstackx deleted the claude/add-dsr1-fp8-b300-sglang branch April 17, 2026 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant