dsv4-fp4 sglang b300: floor --max-running-requests at 8#1173
Merged
Qiaolin-Yu merged 3 commits intoSemiAnalysisAI:mainfrom Apr 26, 2026
Merged
Conversation
Mirrors the floor-of-4 pattern from the mi355x atom script (SemiAnalysisAI#1170); prevents tiny CONC values from yielding sub-optimal max-running-requests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ts floor) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Qiaolin-Yu
approved these changes
Apr 26, 2026
yhyang201
added a commit
to Qiaolin-Yu/InferenceX
that referenced
this pull request
Apr 26, 2026
Rebase perf-changelog.yaml on latest main (preserving SemiAnalysisAI#1173 and SemiAnalysisAI#1174 entries) and append the MTP config entry for PR SemiAnalysisAI#1166.
Qiaolin-Yu
pushed a commit
that referenced
this pull request
Apr 26, 2026
* sglang dsv4 mtp * knob-driven recipe selection * self-contained mtp config; recipe via dp-attn * add mtp_1 (1/1/2) variant * knob-driven recipe selection * pin sglang image to mega_moe-capable digest * drop mtp_1 knob; align with PR #1158 image digest * update nvidia-master.yaml * fix: restore trailing newline in perf-changelog.yaml * fix: remove --use-chat-template and floor --max-running-requests at 8 The tokenizer for DSv4-Pro has no chat_template set, so --use-chat-template causes benchmark_serving.py to crash with ValueError. Remove it to align with dsv4_fp4_b300_sglang.sh. Also add a floor of 8 to --max-running-requests to match the base script and avoid too-low values at low concurrency. * perf-changelog: add dsv4-fp4-b300-sglang-mtp entry Rebase perf-changelog.yaml on latest main (preserving #1173 and #1174 entries) and append the MTP config entry for PR #1166. * dsv4-b300-sglang-mtp: tune EAGLE spec params from (3,1,4) to (4,1,5) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "dsv4-b300-sglang-mtp: tune EAGLE spec params from (3,1,4) to (4,1,5)" This reverts the EAGLE spec params back to (3, 1, 4): --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Bryan Shan <58582368+Oseltamivir@users.noreply.github.com> Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2 tasks
cquil11
added a commit
that referenced
this pull request
Apr 26, 2026
* dsv4-fp4-b300-sglang: revert to #1143 low-latency-only baseline Reverts the matrix expansion (#1132), script edits (#1158, #1173, #1174), and changelog retriggers (#1178) on top of the original #1143 entry. Restores the script and config block to their #1143 state and clears all prior dsv4-fp4-b300-sglang changelog entries to start fresh. The dsv4-fp4-b300-sglang-mtp config (#1166) is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: add pr-link for #1184 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: keep only the original #1143 entry, drop new entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cquil11
added a commit
that referenced
this pull request
Apr 26, 2026
…C split + DP-attn SWA tweak (#1185) * dsv4-fp4-b300-sglang: recipe-per-CONC split + DP-attn SWA tweak Squashes the cumulative changes from #1158 and #1174 into a single commit on top of the #1184 baseline. Excludes the iterative --max-running-requests floor from #1173. - Image pinned to lmsysorg/sglang:deepseek-v4-b300@sha256:26e116bd... - Search space: TP8/EP1 conc=1, TP4/EP1 conc=32, TP4/EP4 dp-attn conc=512 for both 1k1k and 8k1k - Script dispatches on DP_ATTENTION knob: TP-only (flashinfer_mxfp4) vs DP-attn (deepep + prefill-delayer + mega_moe env vars) - DP-attn path enables SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN=1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: add pr-link for #1185 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Apply suggestion from @Qiaolin-Yu --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--max-running-requestsat 8 indsv4_fp4_b300_sglang.shso smallCONCvalues don't yield sub-optimal queue depth.CONC * 3 / 2sizing.Test plan
--max-running-requestsis launched as 8.--max-running-requestsis stillCONC * 3 / 2.🤖 Generated with Claude Code