dsv4-fp4-b200-sglang: revert b200 portion of #1158#1186
Conversation
Mirrors the b300 revert in #1184. Restores benchmarks/single_node/ dsv4_fp4_b200.sh and the dsv4-fp4-b200-sglang block in nvidia-master.yaml to their pre-#1158 state (= post-#1131 baseline) — un-pins the image digest and restores conc-start=4 in the low-latency rows. No perf-changelog edit needed; #1158 did not add a b200 changelog entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
…1158) Mirror of #1185 for the b200 side. Re-applies the b200-specific changes from #1158 on top of the #1186 baseline. - Image pinned to lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4... - Adds DP_ATTENTION env knob and SGLANG_OPT_* perf env vars - Search space gets conc-start=1 in low-latency rows (was 4) - Recipe-per-CONC dispatch in script: low-latency / balanced / max-throughput selected by DP_ATTENTION + CONC Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Mechanical benchmark-config revert mirroring #1184; restores the b200 script and nvidia-master.yaml entries to their post-#1131 baseline.
Extended reasoning...
Overview
This PR mirrors the already-merged #1184 for the b200 side. It reverts the b200-specific changes from #1158 by restoring benchmarks/single_node/dsv4_fp4_b200.sh to a CONC-based recipe dispatch (low-latency / balanced / max-throughput) and updating two entries in .github/configs/nvidia-master.yaml (un-pin the sglang image digest and restore conc-start: 4 in the low-latency rows).
Security risks
None — the changes are confined to a CI benchmark shell script and a YAML config file used only for benchmark sweeps. No auth/crypto/permissions code is touched.
Level of scrutiny
Low. This is benchmark/CI sweep infrastructure, not production runtime code. The change is a straightforward revert to a previously-known-good state, parallel in shape to #1184 which has already landed.
Other factors
The bug-hunting system found no bugs. The only oddity is the un-pinning of the sglang image digest, which gives up reproducibility — but that matches the pre-#1158 baseline behavior the PR explicitly aims to restore, and the b300 side made the same choice in #1184.
…C split (#1187) * dsv4-fp4-b200-sglang: recipe-per-CONC dispatch (re-apply b200 part of #1158) Mirror of #1185 for the b200 side. Re-applies the b200-specific changes from #1158 on top of the #1186 baseline. - Image pinned to lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4... - Adds DP_ATTENTION env knob and SGLANG_OPT_* perf env vars - Search space gets conc-start=1 in low-latency rows (was 4) - Recipe-per-CONC dispatch in script: low-latency / balanced / max-throughput selected by DP_ATTENTION + CONC Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: add pr-link for #1187 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * dsv4-fp4-b200-sglang: restore --disable-radix-cache flag The flag was accidentally dropped during the recipe-per-CONC rewrite. Restoring it to match the baseline methodology (prefix caching disabled) and stay consistent with all other dsv4 sister scripts. Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update changelog and YAML comments to match two-way DP_ATTENTION dispatch The script has two branches (DP_ATTENTION true/false), not three CONC-keyed recipes. Both balanced and max-throughput rows use the same DP-attention + DeepEP flags — only --max-running-requests differs. Updated the nvidia-master.yaml comment block and perf-changelog description to accurately reflect this two-recipe dispatch. Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>
Summary
Mirror of #1184 for the b200 side. Reverts the b200-specific changes from #1158 to their pre-#1158 baseline (= post-#1131 state).
What's reverted
benchmarks/single_node/dsv4_fp4_b200.sh— restored to its post-[NVIDIA] chore: B200 single node DeepSeek v4 SGLang #1131 form. Drops theDP_ATTENTIONenv knob, theSGLANG_OPT_*env block, and the dualPARALLEL_ARGSbranches; restores the original CONC-based recipe dispatch (low-latency / balanced / max-throughput selected by CONC inside the script).dsv4-fp4-b200-sglangblock innvidia-master.yaml— un-pinsimage: lmsysorg/sglang:deepseek-v4-blackwell(drops the@sha256:df18bfc4...digest), and restoresconc-start: 4in the low-latency rows for both 1k1k and 8k1k (wasconc-start: 1).Not touched
dsv4-fp4-b300-sglang(already handled by dsv4-fp4-b300-sglang: revert to #1143 low-latency-only baseline #1184 + [co-authored with sglang community maintainers leads at radixark] [NVIDIA][SGLang][redo PR] B300 DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak #1185).perf-changelog.yaml— fix sgl b200/b300 dpsk-v4 script #1158 did not add a b200 changelog entry; the existing [NVIDIA] chore: B200 single node DeepSeek v4 SGLang #1131 entry stays as-is.Test plan
tp=8 ep=1 conc 4-32, balancedtp=8 ep=8 dp-attn conc 64-128, max-throughputtp=8 ep=8 dp-attn conc 256-{512,1024}).🤖 Generated with Claude Code