dsv4-fp4 sglang b300: floor --max-running-requests at 8 by sglang-bot · Pull Request #1173 · SemiAnalysisAI/InferenceX

sglang-bot · 2026-04-26T10:33:21Z

Summary

Floor --max-running-requests at 8 in dsv4_fp4_b300_sglang.sh so small CONC values don't yield sub-optimal queue depth.
Mirrors the floor pattern from the mi355x atom script (dsv4-fp4-mi355x-atom: size --max-num-seqs to CONC with floor of 4 #1170), adapted for the sglang script's CONC * 3 / 2 sizing.

Test plan

Run a low-CONC sglang benchmark (e.g. CONC=1) on b300 and confirm --max-running-requests is launched as 8.
Run a high-CONC sglang benchmark and confirm --max-running-requests is still CONC * 3 / 2.

🤖 Generated with Claude Code

Mirrors the floor-of-4 pattern from the mi355x atom script (SemiAnalysisAI#1170); prevents tiny CONC values from yielding sub-optimal max-running-requests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

…ts floor) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Rebase perf-changelog.yaml on latest main (preserving SemiAnalysisAI#1173 and SemiAnalysisAI#1174 entries) and append the MTP config entry for PR SemiAnalysisAI#1166.

* sglang dsv4 mtp * knob-driven recipe selection * self-contained mtp config; recipe via dp-attn * add mtp_1 (1/1/2) variant * knob-driven recipe selection * pin sglang image to mega_moe-capable digest * drop mtp_1 knob; align with PR #1158 image digest * update nvidia-master.yaml * fix: restore trailing newline in perf-changelog.yaml * fix: remove --use-chat-template and floor --max-running-requests at 8 The tokenizer for DSv4-Pro has no chat_template set, so --use-chat-template causes benchmark_serving.py to crash with ValueError. Remove it to align with dsv4_fp4_b300_sglang.sh. Also add a floor of 8 to --max-running-requests to match the base script and avoid too-low values at low concurrency. * perf-changelog: add dsv4-fp4-b300-sglang-mtp entry Rebase perf-changelog.yaml on latest main (preserving #1173 and #1174 entries) and append the MTP config entry for PR #1166. * dsv4-b300-sglang-mtp: tune EAGLE spec params from (3,1,4) to (4,1,5) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "dsv4-b300-sglang-mtp: tune EAGLE spec params from (3,1,4) to (4,1,5)" This reverts the EAGLE spec params back to (3, 1, 4): --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Bryan Shan <58582368+Oseltamivir@users.noreply.github.com> Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* dsv4-fp4-b300-sglang: revert to #1143 low-latency-only baseline Reverts the matrix expansion (#1132), script edits (#1158, #1173, #1174), and changelog retriggers (#1178) on top of the original #1143 entry. Restores the script and config block to their #1143 state and clears all prior dsv4-fp4-b300-sglang changelog entries to start fresh. The dsv4-fp4-b300-sglang-mtp config (#1166) is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: add pr-link for #1184 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: keep only the original #1143 entry, drop new entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Qiaolin-Yu

…C split + DP-attn SWA tweak (#1185) * dsv4-fp4-b300-sglang: recipe-per-CONC split + DP-attn SWA tweak Squashes the cumulative changes from #1158 and #1174 into a single commit on top of the #1184 baseline. Excludes the iterative --max-running-requests floor from #1173. - Image pinned to lmsysorg/sglang:deepseek-v4-b300@sha256:26e116bd... - Search space: TP8/EP1 conc=1, TP4/EP1 conc=32, TP4/EP4 dp-attn conc=512 for both 1k1k and 8k1k - Script dispatches on DP_ATTENTION knob: TP-only (flashinfer_mxfp4) vs DP-attn (deepep + prefill-delayer + mega_moe env vars) - DP-attn path enables SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN=1 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: add pr-link for #1185 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Apply suggestion from @Qiaolin-Yu --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com>

dsv4-fp4 sglang b200/b300: floor --max-running-requests at 8

358e17b

Mirrors the floor-of-4 pattern from the mi355x atom script (SemiAnalysisAI#1170); prevents tiny CONC values from yielding sub-optimal max-running-requests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

sglang-bot requested a review from a team April 26, 2026 10:33

github-project-automation Bot added this to InferenceMAX Board Apr 26, 2026

claude Bot reviewed Apr 26, 2026

View reviewed changes

sglang-bot and others added 2 commits April 26, 2026 03:35

perf-changelog: add entry for SemiAnalysisAI#1173 (max-running-reques…

90faa53

…ts floor) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

revert b200 change; scope to b300 only

e0f0c32

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

sglang-bot changed the title ~~dsv4-fp4 sglang b200/b300: floor --max-running-requests at 8~~ dsv4-fp4 sglang b300: floor --max-running-requests at 8 Apr 26, 2026

Qiaolin-Yu approved these changes Apr 26, 2026

View reviewed changes

Qiaolin-Yu merged commit 00fe30d into SemiAnalysisAI:main Apr 26, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 26, 2026

cquil11 mentioned this pull request Apr 26, 2026

dsv4-fp4-b300-sglang: revert to #1143 low-latency-only baseline #1184

Merged

2 tasks

cquil11 mentioned this pull request Apr 26, 2026

[co-authored with sglang community maintainers leads at radixark] [NVIDIA][SGLang][redo PR] B300 DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak #1185

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dsv4-fp4 sglang b300: floor --max-running-requests at 8#1173

dsv4-fp4 sglang b300: floor --max-running-requests at 8#1173
Qiaolin-Yu merged 3 commits intoSemiAnalysisAI:mainfrom
sglang-bot:dsv4-fp4-sglang-floor-max-running-requests

sglang-bot commented Apr 26, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sglang-bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sglang-bot commented Apr 26, 2026 •

edited

Loading