dsv4-fp4-b200-sglang: revert b200 portion of #1158 by cquil11 · Pull Request #1186 · SemiAnalysisAI/InferenceX

cquil11 · 2026-04-26T20:08:24Z

Summary

Mirror of #1184 for the b200 side. Reverts the b200-specific changes from #1158 to their pre-#1158 baseline (= post-#1131 state).

What's reverted

benchmarks/single_node/dsv4_fp4_b200.sh — restored to its post-[NVIDIA] chore: B200 single node DeepSeek v4 SGLang #1131 form. Drops the DP_ATTENTION env knob, the SGLANG_OPT_* env block, and the dual PARALLEL_ARGS branches; restores the original CONC-based recipe dispatch (low-latency / balanced / max-throughput selected by CONC inside the script).
dsv4-fp4-b200-sglang block in nvidia-master.yaml — un-pins image: lmsysorg/sglang:deepseek-v4-blackwell (drops the @sha256:df18bfc4... digest), and restores conc-start: 4 in the low-latency rows for both 1k1k and 8k1k (was conc-start: 1).

Not touched

dsv4-fp4-b300-sglang (already handled by dsv4-fp4-b300-sglang: revert to #1143 low-latency-only baseline #1184 + [co-authored with sglang community maintainers leads at radixark] [NVIDIA][SGLang][redo PR] B300 DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak #1185).
perf-changelog.yaml — fix sgl b200/b300 dpsk-v4 script #1158 did not add a b200 changelog entry; the existing [NVIDIA] chore: B200 single node DeepSeek v4 SGLang #1131 entry stays as-is.

Test plan

Sweep run on B200 against the restored matrix (low-latency tp=8 ep=1 conc 4-32, balanced tp=8 ep=8 dp-attn conc 64-128, max-throughput tp=8 ep=8 dp-attn conc 256-{512,1024}).

🤖 Generated with Claude Code

Mirrors the b300 revert in #1184. Restores benchmarks/single_node/ dsv4_fp4_b200.sh and the dsv4-fp4-b200-sglang block in nvidia-master.yaml to their pre-#1158 state (= post-#1131 baseline) — un-pins the image digest and restores conc-start=4 in the low-latency rows. No perf-changelog edit needed; #1158 did not add a b200 changelog entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-26T20:08:32Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-26T20:08:32Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

…1158) Mirror of #1185 for the b200 side. Re-applies the b200-specific changes from #1158 on top of the #1186 baseline. - Image pinned to lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4... - Adds DP_ATTENTION env knob and SGLANG_OPT_* perf env vars - Search space gets conc-start=1 in low-latency rows (was 4) - Recipe-per-CONC dispatch in script: low-latency / balanced / max-throughput selected by DP_ATTENTION + CONC Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude

Mechanical benchmark-config revert mirroring #1184; restores the b200 script and nvidia-master.yaml entries to their post-#1131 baseline.

Extended reasoning...

Overview

This PR mirrors the already-merged #1184 for the b200 side. It reverts the b200-specific changes from #1158 by restoring benchmarks/single_node/dsv4_fp4_b200.sh to a CONC-based recipe dispatch (low-latency / balanced / max-throughput) and updating two entries in .github/configs/nvidia-master.yaml (un-pin the sglang image digest and restore conc-start: 4 in the low-latency rows).

Security risks

None — the changes are confined to a CI benchmark shell script and a YAML config file used only for benchmark sweeps. No auth/crypto/permissions code is touched.

Level of scrutiny

Low. This is benchmark/CI sweep infrastructure, not production runtime code. The change is a straightforward revert to a previously-known-good state, parallel in shape to #1184 which has already landed.

Other factors

The bug-hunting system found no bugs. The only oddity is the un-pinning of the sglang image digest, which gives up reproducibility — but that matches the pre-#1158 baseline behavior the PR explicitly aims to restore, and the b300 side made the same choice in #1184.

…C split (#1187) * dsv4-fp4-b200-sglang: recipe-per-CONC dispatch (re-apply b200 part of #1158) Mirror of #1185 for the b200 side. Re-applies the b200-specific changes from #1158 on top of the #1186 baseline. - Image pinned to lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4... - Adds DP_ATTENTION env knob and SGLANG_OPT_* perf env vars - Search space gets conc-start=1 in low-latency rows (was 4) - Recipe-per-CONC dispatch in script: low-latency / balanced / max-throughput selected by DP_ATTENTION + CONC Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: add pr-link for #1187 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * dsv4-fp4-b200-sglang: restore --disable-radix-cache flag The flag was accidentally dropped during the recipe-per-CONC rewrite. Restoring it to match the baseline methodology (prefix caching disabled) and stay consistent with all other dsv4 sister scripts. Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update changelog and YAML comments to match two-way DP_ATTENTION dispatch The script has two branches (DP_ATTENTION true/false), not three CONC-keyed recipes. Both balanced and max-throughput rows use the same DP-attention + DeepEP flags — only --max-running-requests differs. Updated the nvidia-master.yaml comment block and perf-changelog description to accurately reflect this two-recipe dispatch. Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>

cquil11 requested a review from a team April 26, 2026 20:08

cquil11 requested review from jgangani and kedarpotdar-nv as code owners April 26, 2026 20:08

github-project-automation Bot added this to InferenceMAX Board Apr 26, 2026

cquil11 merged commit 1d0a9f0 into main Apr 26, 2026
8 checks passed

cquil11 deleted the chore/revert-dsv4-fp4-b200-sglang-from-1158 branch April 26, 2026 20:10

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 26, 2026

cquil11 mentioned this pull request Apr 26, 2026

[NVIDIA][SGLang][redo PR] B200 DeepSeek v4 FP4 SGLang: recipe-per-CONC split #1187

Merged

2 tasks

claude Bot reviewed Apr 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dsv4-fp4-b200-sglang: revert b200 portion of #1158#1186

dsv4-fp4-b200-sglang: revert b200 portion of #1158#1186
cquil11 merged 1 commit intomainfrom
chore/revert-dsv4-fp4-b200-sglang-from-1158

cquil11 commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cquil11 commented Apr 26, 2026

Summary

What's reverted

Not touched

Test plan

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant