[NVIDIA] chore: B200 single node DeepSeek v4 SGLang by cquil11 · Pull Request #1131 · SemiAnalysisAI/InferenceX

cquil11 · 2026-04-24T06:10:27Z

Summary

Adds dsv4-fp4-b200-sglang to .github/configs/nvidia-master.yaml using the lmsysorg/sglang:deepseek-v4-blackwell image and deepseek-ai/DeepSeek-V4-Flash model
Adds benchmarks/single_node/dsv4_fp4_b200.sh following the DeepSeek-V4 B200 SGLang recipe with prefix caching (--disable-radix-cache) and speculative decoding both disabled for baseline numbers
Adds a perf-changelog.yaml entry

Test plan

Sweep run via sweep-enabled label produces results for 1k/1k and 8k/1k ISL/OSL at tp=4 ep=4

🤖 Generated with Claude Code

github-actions · 2026-04-24T06:10:35Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-24T06:10:35Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-24T06:10:35Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-24T06:10:35Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Adds the DeepSeek-V4-Flash B200 SGLang recipe from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4. Prefix caching and speculative decoding are disabled for baseline numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Uses deepseek-ai/DeepSeek-V4-Pro with tp=8, ep=8, dp-attention enabled and sweep concurrency ranges aligned with dsv4-fp4-b200-vllm (4-1024 at 1k/1k, 4-512 at 8k/1k). Script now passes --enable-dp-attention when DP_ATTENTION=true and sets --mem-fraction-static per the Pro recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Server launch now mirrors the DeepSeek-V4-Pro command from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4: --tp N, --moe-runner-backend flashinfer_mxfp4, --mem-fraction-static 0.82, SGLANG_JIT_DEEPGEMM_PRECOMPILE=0. Speculative decoding omitted and --disable-radix-cache added per the no-spec / no-prefix-cache baseline. YAML search-space drops ep/dp-attn to tp=8, ep=1. Also syncs runners/launch_b200-dgxc-slurm.sh with the HF cache mount path from origin/claude/add-dsv4-fp4-b200-vllm so both PRs stay in agreement on runner layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The deepseek-v4-blackwell image doesn't expose sglang via system python3, so the module import fails: /usr/bin/python3: Error while finding module specification for 'sglang.launch_server' (ModuleNotFoundError: No module named 'sglang') Switch to the `sglang serve` entrypoint that the cookbook uses; the CLI resolves the correct interpreter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The lmsysorg/sglang:deepseek-v4-blackwell image installs sglang editable at /workspace/sglang/python — unlike every prior sglang tag which uses /sgl-workspace/sglang. Our $GITHUB_WORKSPACE:/workspace/ bind-mount masks that directory, breaking `import sglang`. Conditionally mount at /ix for this image only and make the dsv4 benchmark script use $PWD for server/metrics/result paths so it works regardless of the mount target. All other configs still mount at /workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The lmsysorg/sglang:deepseek-v4-blackwell image installs sglang editable at /workspace/sglang/python, which our $GITHUB_WORKSPACE:/workspace/ bind-mount masks. Temporary one-line workaround: pip install --no-deps sglang in the benchmark script to restore a non-editable copy in site-packages. Runner reverted to the standard /workspace mount. Marked with a TODO(Cam) for the proper fix once lmsys publishes an image that doesn't editable-install under /workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

'pip install --no-deps sglang' is a no-op when sglang is already registered in site-packages -- even if the underlying editable path is missing -- so the prior workaround never actually swapped in a working install. Uninstall the broken egg-link first, then reinstall. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Back to the proper mount fix so we use the same 'PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...' invocation as every other sglang single_node script. Conditional mount target keeps the blast radius to this one config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The image ENV pins CUDA_VISIBLE_DEVICES=4,5,6,7 (leftover from lmsys's internal testing). With --no-container-entrypoint it isn't cleared, so the container only sees 4 GPUs and TP=8 fails with torch.AcceleratorError: CUDA error: invalid device ordinal Unset it at the top of the script so Slurm's 8-GPU allocation is visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Only patched launch_b200-dgxc-slurm.sh last time; the b200-nb runner still had the default $GITHUB_WORKSPACE:/workspace/ mount, which masks the deepseek-v4-blackwell image's /workspace/sglang editable install. Most B200 jobs in this repo run on b200-nb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Correct suffix from _h200 to _b200 (copy-paste from launch_h200-cw.sh would have routed b200 jobs to non-existent h200 scripts). - Apply the same /ix mount conditional for deepseek-v4-blackwell as the other b200 runners, so sglang's editable install at /workspace/sglang/python isn't masked. - Add b200-cw_00 / b200-cw_01 to the b200 runner pool in runners.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… rmdir - SQUASH_FILE lives under /tmp/gharunner/squash on the allocated worker node and isn't visible from the host, so realpath on the host returned empty and srun failed with 'Invalid --container-image argument: '. Pass the path straight through; srun resolves it inside the job. - Remove the leftover 'rmdir $SAGEMAKER_SHM_PATH' — the env var isn't set in this cluster and rmdir fired with no operand every run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…) by CONC The cookbook documents three B200 recipes for DeepSeek-V4-Pro that differ significantly in server flags. Pick between them based on CONC: CONC <= 32 -> low-latency (TP only, chunked-prefill 4096, disable-flashinfer-autotune) 33..128 -> balanced (+ DP-attention, max-running-reqs=128, cuda-graph-max-bs=64, deepep-config) CONC > 128 -> max-throughput (+ DP-attention, max-running-reqs=256, cuda-graph-max-bs=64, deepep-config) Speculative decoding still omitted from all three per the no-spec baseline, and --disable-radix-cache kept for no-prefix-caching. Thresholds mirror the recipes' own max-running-requests caps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Split the single CONC 4..1024/512 row into three rows (low-latency / balanced / max-throughput) matching the recipe boundaries inside dsv4_fp4_b200.sh so result filenames carry accurate ep= and dpa= labels. ep=8 on balanced/max-throughput reflects sglang's implicit ep_size=tp_size override when --moe-a2a-backend deepep is set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Oseltamivir

lgtm

Mirror of #1146 for B200. Each model historically used one inference engine, so the b200 launchers just resolved benchmarks/single_node/${model}_${precision}_b200.sh regardless of FRAMEWORK. With dsv4 we now want both an sglang script (already on main as dsv4_fp4_b200.sh from #1131) and a vllm script (added by this PR as dsv4_fp4_b200_vllm.sh) to coexist. - launch_b200-{nb,dgxc-slurm,cw}.sh prefer an engine-tagged script (e.g. dsv4_fp4_b200_vllm.sh) and fall back to the legacy unsuffixed name (or the existing _trt suffix) when the tagged variant is absent. Existing dsr1/glm5/qwen3.5/kimik2.5/minimaxm2.5/gptoss /minimaxm2.5/dsv4-sglang b200 scripts keep their current names. - This wires up the dsv4-fp4-b200-vllm config so FRAMEWORK=vllm resolves to dsv4_fp4_b200_vllm.sh instead of the sglang script that shares the unsuffixed path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the b300 revert in #1184. Restores benchmarks/single_node/ dsv4_fp4_b200.sh and the dsv4-fp4-b200-sglang block in nvidia-master.yaml to their pre-#1158 state (= post-#1131 baseline) — un-pins the image digest and restores conc-start=4 in the low-latency rows. No perf-changelog edit needed; #1158 did not add a b200 changelog entry. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cquil11 added the sweep-enabled label Apr 24, 2026

cquil11 requested a review from a team April 24, 2026 06:10

cquil11 added the sweep-enabled label Apr 24, 2026

cquil11 requested review from jgangani and kedarpotdar-nv as code owners April 24, 2026 06:10

github-project-automation Bot added this to InferenceMAX Board Apr 24, 2026

claude Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread perf-changelog.yaml

This was referenced Apr 24, 2026

Update dsv4 B200 SGLang launch: DeepEP + EAGLE speculative decoding #1138

Closed

dsv4 B200 MTP SGLang launch #1139

Closed

cquil11 and others added 12 commits April 24, 2026 10:21

Drop --container-name arg from launch_b200-nb.sh

9779d14

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

change runner

fe012a7

cquil11 force-pushed the chore/dsv4-sgl-b200 branch from c681a6d to fe012a7 Compare April 24, 2026 15:22

cquil11 and others added 4 commits April 24, 2026 12:58

update recipe

151a62f

update recipe

ffd8e47

update model storage to nvme

3a354ef

cquil11 and others added 2 commits April 24, 2026 13:43

cquil11 added full-sweep-enabled and removed sweep-enabled labels Apr 24, 2026

cquil11 and others added 2 commits April 24, 2026 15:17

update b200

4a96602

cquil11 added the NVIDIA label Apr 24, 2026

cquil11 changed the title ~~Add dsv4-fp4-b200-sglang single-node config~~ [NVIDIA] chore: B200 single node DeepSeek v4 SGLang Apr 24, 2026

Oseltamivir approved these changes Apr 24, 2026

View reviewed changes

Merge branch 'main' into chore/dsv4-sgl-b200

a2241a5

cquil11 merged commit 2136afb into main Apr 24, 2026
2 checks passed

cquil11 deleted the chore/dsv4-sgl-b200 branch April 24, 2026 22:18

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 24, 2026

cquil11 mentioned this pull request Apr 24, 2026

[NVIDIA] chore: B200 single node DeepSeek v4 SGLang MTP #1145

Open

1 task

claude Bot mentioned this pull request Apr 25, 2026

Add DSv4 B200 configs #1156

Merged

cquil11 mentioned this pull request Apr 26, 2026

dsv4-fp4-b200-sglang: revert b200 portion of #1158 #1186

Merged

1 task

claude Bot mentioned this pull request Apr 26, 2026

[NVIDIA][SGLang][redo PR] B200 DeepSeek v4 FP4 SGLang: recipe-per-CONC split #1187

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] chore: B200 single node DeepSeek v4 SGLang#1131

[NVIDIA] chore: B200 single node DeepSeek v4 SGLang#1131
cquil11 merged 21 commits intomainfrom
chore/dsv4-sgl-b200

cquil11 commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Uh oh!

Oseltamivir left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cquil11 commented Apr 24, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Uh oh!

Oseltamivir left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants