dsv4-fp4-mi355x-atom: size --max-num-seqs to CONC with floor of 4 by Oseltamivir · Pull Request #1170 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-04-26T09:13:56Z

Summary

Follow-up to #1165 (dsv4-fp4-mi355x-atom Day-0). Fixes a runtime crash and enables higher-concurrency sweep entries to actually saturate the scheduler.

The first run after #1165 hit:

RuntimeError: GDN mamba tensor (0.00GB for 0 slots) exceeds available KV budget (-18.74GB).
Reduce --max-num-seqs or increase gpu_memory_utilization.

This was because removing the explicit --max-num-seqs 4 in #1165 fell back to ATOM's default of 512, which makes the KV / GDN-mamba allocator overshoot the per-GPU memory budget at our --max-model-len 2304.

Change

Set --max-num-seqs per-config to max(CONC, 4):

MAX_NUM_SEQS=$(( CONC < 4 ? 4 : CONC ))

CONC	MAX_NUM_SEQS
1	4
4	4
16	16
32	32

Why these specific bounds

Floor of 4: --max-num-seqs 1 was previously observed to hang warmup at 0% GPU. 4 is the minimum value we've seen reach a running server, also the value AMD used in PR Claude Opus 4.6 #650's offline repro.
Match CONC for higher entries: lets conc=16 and conc=32 sweep entries actually saturate the scheduler instead of being throttled at a fixed ceiling.
No perf loss for conc=1: the client only ever has 1 in flight there, so max-num-seqs of 4 vs 1 is identical wall-clock.

kv_cache[:1,...] risk recap (unchanged from #1165)

ROCm/ATOM#650 PR1 still has self.kv_cache[:1, ...] hardcoded in deepseek_v4.py. Any forward with batch>1 silently corrupts non-slot-0 lanes. Higher max-num-seqs increases the chance the scheduler assembles a batch>1 forward when conc>1 from the client. eval (gsm8k) at conc>1 is the canary — accuracy collapse vs conc=1 indicates the corruption is biting. Throughput numbers alone won't catch it.

Test plan

CI re-runs the dsv4-fp4-mi355x-atom sweep and reaches steady-state without the GDN-mamba budget error
gsm8k eval at conc=4/16/32 produces accuracy comparable to conc=1 (canary for kv_cache[:1] silent corruption)

ATOM's default max_num_seqs of 512 made the KV / GDN-mamba allocator overshoot the per-GPU memory budget ("GDN mamba tensor exceeds available KV budget"). Setting it to 1 was previously observed to hang warmup at 0% GPU. 4 is the minimum value we've seen reach a running server (also the PR's offline repro value), and matching upward to CONC ensures the higher-concurrency sweep entries can actually saturate the scheduler instead of throttling at a fixed ceiling. Floor at 4 for CONC<4 (only conc=1 in the current sweep): no perf loss vs setting it to 1 since the client only ever has 1 request in flight there, and avoids re-hitting the warmup hang.

github-actions · 2026-04-26T09:14:04Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Duplicates the #1165 entry to retrigger dsv4-fp4-mi355x-atom against the new commit (max-num-seqs floor of 4) on top of the merged Day-0 marker. Same configuration narrative; only pr-link differs.

claude · 2026-04-26T09:20:41Z

+# non-slot-0 lanes; eval (gsm8k) at conc>1 is the canary.
+MAX_NUM_SEQS=$(( CONC < 4 ? 4 : CONC ))
 python3 -m atom.entrypoints.openai_server \


🟡 Nit: the EP-guard rationale block at lines 22-28 still claims "max-num-seqs=4 caps the running batch below the YAML's max conc (32)", but with this PR's MAX_NUM_SEQS=$(( CONC < 4 ? 4 : CONC )), CONC=32 yields max-num-seqs=32 — equal to (not below) the YAML max. Worth updating that comment in the same change so the two rationale blocks (lines 22-28 vs the new 207-216) tell the same story.

Extended reasoning...

What's stale: Lines 22-28 of benchmarks/single_node/dsv4_fp4_mi355x_atom.sh are unchanged by this PR and still read:

The CONC guard was relaxed to empirically probe whether kv_cache[:1,...] in deepseek_v4.py actually corrupts at batch>1 in the server path: max-num-seqs=4 caps the running batch below the YAML's max conc (32), and per-sequence eval correctness will tell us if the hardcode bites.

Why it's now wrong: the new line 216 introduces MAX_NUM_SEQS=$(( CONC < 4 ? 4 : CONC )). For the YAML's max CONC=32, this evaluates to MAX_NUM_SEQS=32 — the running batch ceiling now equals the YAML max conc, it does not cap below it. The per-CONC table in the PR description confirms this (CONC=32 → MAX_NUM_SEQS=32). So the load-bearing premise of the EP/CONC-guard rationale comment ("max-num-seqs=4 caps the running batch below 32") no longer holds.

Step-by-step proof:

YAML max conc per the lines 22-28 comment is 32.

After this PR, the script computes MAX_NUM_SEQS=$(( 32 < 4 ? 4 : 32 )) when CONC=32.

$(( 32 < 4 ? 4 : 32 )) = 32.

The server is launched with --max-num-seqs 32, so the scheduler can assemble batches up to 32.

The lines 22-28 comment still says max-num-seqs=4 caps the batch below 32. That sentence is now factually false at CONC=32.

Why the new lines 207-215 don't make this self-correcting: the new comment block describes what the new sizing does ("sized to the client concurrency with a floor at 4") but doesn't touch the EP-guard rationale. A future reader who looks at the EP guard at lines 22-30 to understand why only EP_SIZE != 1 exits-fatal (and why the symmetric CONC guard was dropped) will read the now-false claim that "max-num-seqs=4 caps the running batch below the YAML's max conc (32)" as the justification.

Addressing the refutation: The refuter argues this is comment drift that doesn't lead a reader to take an incorrect action, citing the nit threshold. That's why this is filed as nit, not normal — agreed it doesn't affect runtime. But the lines 22-28 block is the only place that documents the EP-guard reasoning in this script, and the false premise materially changes the argument: the original rationale was "we kept the CONC guard's safety net via the max-num-seqs cap"; with this PR there is no such cap at CONC=32. Two out of two co-located comment blocks now disagree with each other in the same file, and the fix is a one-line edit. Worth folding into this PR.

How to fix: update the sentence in lines 22-28 to match the new sizing, e.g. "max-num-seqs is sized to CONC (floored at 4), so the running batch can reach the YAML's max conc — gsm8k accuracy at conc>1 vs conc=1 is the canary for the kv_cache[:1,...] hardcode biting." Or simply delete the parenthetical about the cap and point to the lines 207-215 block.

* dsv4-fp4 sglang b200/b300: floor --max-running-requests at 8 Mirrors the floor-of-4 pattern from the mi355x atom script (#1170); prevents tiny CONC values from yielding sub-optimal max-running-requests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * perf-changelog: add entry for #1173 (max-running-requests floor) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * revert b200 change; scope to b300 only Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

Oseltamivir requested a review from a team April 26, 2026 09:13

github-project-automation Bot added this to InferenceMAX Board Apr 26, 2026

Oseltamivir added the sweep-enabled label Apr 26, 2026

add perf-changelog entry for PR #1170

11c62a3

Duplicates the #1165 entry to retrigger dsv4-fp4-mi355x-atom against the new commit (max-num-seqs floor of 4) on top of the merged Day-0 marker. Same configuration narrative; only pr-link differs.

Oseltamivir merged commit 8c33454 into main Apr 26, 2026
9 of 12 checks passed

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 26, 2026

Oseltamivir deleted the dsv4-mi355x-atom-max-num-seqs branch April 26, 2026 09:16

claude Bot reviewed Apr 26, 2026

View reviewed changes

claude Bot mentioned this pull request Apr 26, 2026

retry sglang b300 #1171

Merged

sglang-bot mentioned this pull request Apr 26, 2026

dsv4-fp4 sglang b300: floor --max-running-requests at 8 #1173

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dsv4-fp4-mi355x-atom: size --max-num-seqs to CONC with floor of 4#1170

dsv4-fp4-mi355x-atom: size --max-num-seqs to CONC with floor of 4#1170
Oseltamivir merged 2 commits intomainfrom
dsv4-mi355x-atom-max-num-seqs

Oseltamivir commented Apr 26, 2026

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

Uh oh!

claude Bot Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Apr 26, 2026

Summary

Change

Why these specific bounds

kv_cache[:1,...] risk recap (unchanged from #1165)

Test plan

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

Uh oh!

claude Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant