fix sgl b200/b300 dpsk-v4 script#1158
Conversation
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b200-sglang |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@Fridge003 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24943006152 |
|
@Fridge003 Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24943008546 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b200-sglang |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@Qiaolin-Yu Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24945687144 |
|
@Qiaolin-Yu Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24945689122 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@Qiaolin-Yu Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24947777157 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@Qiaolin-Yu Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24948424840 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@Qiaolin-Yu Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24948791360 |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys dsv4-fp4-b300-sglang |
|
@Qiaolin-Yu Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24953046586 |
* sglang dsv4 mtp * knob-driven recipe selection * self-contained mtp config; recipe via dp-attn * add mtp_1 (1/1/2) variant * knob-driven recipe selection * pin sglang image to mega_moe-capable digest * drop mtp_1 knob; align with PR #1158 image digest * update nvidia-master.yaml * fix: restore trailing newline in perf-changelog.yaml * fix: remove --use-chat-template and floor --max-running-requests at 8 The tokenizer for DSv4-Pro has no chat_template set, so --use-chat-template causes benchmark_serving.py to crash with ValueError. Remove it to align with dsv4_fp4_b300_sglang.sh. Also add a floor of 8 to --max-running-requests to match the base script and avoid too-low values at low concurrency. * perf-changelog: add dsv4-fp4-b300-sglang-mtp entry Rebase perf-changelog.yaml on latest main (preserving #1173 and #1174 entries) and append the MTP config entry for PR #1166. * dsv4-b300-sglang-mtp: tune EAGLE spec params from (3,1,4) to (4,1,5) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "dsv4-b300-sglang-mtp: tune EAGLE spec params from (3,1,4) to (4,1,5)" This reverts the EAGLE spec params back to (3, 1, 4): --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Bryan Shan <58582368+Oseltamivir@users.noreply.github.com> Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* dsv4-fp4-b300-sglang: revert to #1143 low-latency-only baseline Reverts the matrix expansion (#1132), script edits (#1158, #1173, #1174), and changelog retriggers (#1178) on top of the original #1143 entry. Restores the script and config block to their #1143 state and clears all prior dsv4-fp4-b300-sglang changelog entries to start fresh. The dsv4-fp4-b300-sglang-mtp config (#1166) is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: add pr-link for #1184 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: keep only the original #1143 entry, drop new entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the b300 revert in #1184. Restores benchmarks/single_node/ dsv4_fp4_b200.sh and the dsv4-fp4-b200-sglang block in nvidia-master.yaml to their pre-#1158 state (= post-#1131 baseline) — un-pins the image digest and restores conc-start=4 in the low-latency rows. No perf-changelog edit needed; #1158 did not add a b200 changelog entry. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…1158) Mirror of #1185 for the b200 side. Re-applies the b200-specific changes from #1158 on top of the #1186 baseline. - Image pinned to lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4... - Adds DP_ATTENTION env knob and SGLANG_OPT_* perf env vars - Search space gets conc-start=1 in low-latency rows (was 4) - Recipe-per-CONC dispatch in script: low-latency / balanced / max-throughput selected by DP_ATTENTION + CONC Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…C split + DP-attn SWA tweak (#1185) * dsv4-fp4-b300-sglang: recipe-per-CONC split + DP-attn SWA tweak Squashes the cumulative changes from #1158 and #1174 into a single commit on top of the #1184 baseline. Excludes the iterative --max-running-requests floor from #1173. - Image pinned to lmsysorg/sglang:deepseek-v4-b300@sha256:26e116bd... - Search space: TP8/EP1 conc=1, TP4/EP1 conc=32, TP4/EP4 dp-attn conc=512 for both 1k1k and 8k1k - Script dispatches on DP_ATTENTION knob: TP-only (flashinfer_mxfp4) vs DP-attn (deepep + prefill-delayer + mega_moe env vars) - DP-attn path enables SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN=1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: add pr-link for #1185 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Apply suggestion from @Qiaolin-Yu --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
…C split (#1187) * dsv4-fp4-b200-sglang: recipe-per-CONC dispatch (re-apply b200 part of #1158) Mirror of #1185 for the b200 side. Re-applies the b200-specific changes from #1158 on top of the #1186 baseline. - Image pinned to lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4... - Adds DP_ATTENTION env knob and SGLANG_OPT_* perf env vars - Search space gets conc-start=1 in low-latency rows (was 4) - Recipe-per-CONC dispatch in script: low-latency / balanced / max-throughput selected by DP_ATTENTION + CONC Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: add pr-link for #1187 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * dsv4-fp4-b200-sglang: restore --disable-radix-cache flag The flag was accidentally dropped during the recipe-per-CONC rewrite. Restoring it to match the baseline methodology (prefix caching disabled) and stay consistent with all other dsv4 sister scripts. Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update changelog and YAML comments to match two-way DP_ATTENTION dispatch The script has two branches (DP_ATTENTION true/false), not three CONC-keyed recipes. Both balanced and max-throughput rows use the same DP-attention + DeepEP flags — only --max-running-requests differs. Updated the nvidia-master.yaml comment block and perf-changelog description to accurately reflect this two-recipe dispatch. Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>
No description provided.