Skip to content

dsv4-fp4-mi355x-atom: size --max-num-seqs to CONC with floor of 4#1170

Merged
Oseltamivir merged 2 commits intomainfrom
dsv4-mi355x-atom-max-num-seqs
Apr 26, 2026
Merged

dsv4-fp4-mi355x-atom: size --max-num-seqs to CONC with floor of 4#1170
Oseltamivir merged 2 commits intomainfrom
dsv4-mi355x-atom-max-num-seqs

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

Summary

Follow-up to #1165 (dsv4-fp4-mi355x-atom Day-0). Fixes a runtime crash and enables higher-concurrency sweep entries to actually saturate the scheduler.

The first run after #1165 hit:

RuntimeError: GDN mamba tensor (0.00GB for 0 slots) exceeds available KV budget (-18.74GB).
Reduce --max-num-seqs or increase gpu_memory_utilization.

This was because removing the explicit --max-num-seqs 4 in #1165 fell back to ATOM's default of 512, which makes the KV / GDN-mamba allocator overshoot the per-GPU memory budget at our --max-model-len 2304.

Change

Set --max-num-seqs per-config to max(CONC, 4):

MAX_NUM_SEQS=$(( CONC < 4 ? 4 : CONC ))
CONC MAX_NUM_SEQS
1 4
4 4
16 16
32 32

Why these specific bounds

  • Floor of 4: --max-num-seqs 1 was previously observed to hang warmup at 0% GPU. 4 is the minimum value we've seen reach a running server, also the value AMD used in PR Claude Opus 4.6 #650's offline repro.
  • Match CONC for higher entries: lets conc=16 and conc=32 sweep entries actually saturate the scheduler instead of being throttled at a fixed ceiling.
  • No perf loss for conc=1: the client only ever has 1 in flight there, so max-num-seqs of 4 vs 1 is identical wall-clock.

kv_cache[:1,...] risk recap (unchanged from #1165)

ROCm/ATOM#650 PR1 still has self.kv_cache[:1, ...] hardcoded in deepseek_v4.py. Any forward with batch>1 silently corrupts non-slot-0 lanes. Higher max-num-seqs increases the chance the scheduler assembles a batch>1 forward when conc>1 from the client. eval (gsm8k) at conc>1 is the canary — accuracy collapse vs conc=1 indicates the corruption is biting. Throughput numbers alone won't catch it.

Test plan

  • CI re-runs the dsv4-fp4-mi355x-atom sweep and reaches steady-state without the GDN-mamba budget error
  • gsm8k eval at conc=4/16/32 produces accuracy comparable to conc=1 (canary for kv_cache[:1] silent corruption)

ATOM's default max_num_seqs of 512 made the KV / GDN-mamba allocator
overshoot the per-GPU memory budget ("GDN mamba tensor exceeds
available KV budget"). Setting it to 1 was previously observed to
hang warmup at 0% GPU. 4 is the minimum value we've seen reach a
running server (also the PR's offline repro value), and matching
upward to CONC ensures the higher-concurrency sweep entries can
actually saturate the scheduler instead of throttling at a fixed
ceiling.

Floor at 4 for CONC<4 (only conc=1 in the current sweep): no perf
loss vs setting it to 1 since the client only ever has 1 request
in flight there, and avoids re-hitting the warmup hang.
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Duplicates the #1165 entry to retrigger dsv4-fp4-mi355x-atom against
the new commit (max-num-seqs floor of 4) on top of the merged Day-0
marker. Same configuration narrative; only pr-link differs.
@Oseltamivir Oseltamivir merged commit 8c33454 into main Apr 26, 2026
9 of 12 checks passed
@Oseltamivir Oseltamivir deleted the dsv4-mi355x-atom-max-num-seqs branch April 26, 2026 09:16
Comment on lines +215 to 217
# non-slot-0 lanes; eval (gsm8k) at conc>1 is the canary.
MAX_NUM_SEQS=$(( CONC < 4 ? 4 : CONC ))
python3 -m atom.entrypoints.openai_server \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: the EP-guard rationale block at lines 22-28 still claims "max-num-seqs=4 caps the running batch below the YAML's max conc (32)", but with this PR's MAX_NUM_SEQS=$(( CONC < 4 ? 4 : CONC )), CONC=32 yields max-num-seqs=32 — equal to (not below) the YAML max. Worth updating that comment in the same change so the two rationale blocks (lines 22-28 vs the new 207-216) tell the same story.

Extended reasoning...

What's stale: Lines 22-28 of benchmarks/single_node/dsv4_fp4_mi355x_atom.sh are unchanged by this PR and still read:

The CONC guard was relaxed to empirically probe whether kv_cache[:1,...] in deepseek_v4.py actually corrupts at batch>1 in the server path: max-num-seqs=4 caps the running batch below the YAML's max conc (32), and per-sequence eval correctness will tell us if the hardcode bites.

Why it's now wrong: the new line 216 introduces MAX_NUM_SEQS=$(( CONC < 4 ? 4 : CONC )). For the YAML's max CONC=32, this evaluates to MAX_NUM_SEQS=32 — the running batch ceiling now equals the YAML max conc, it does not cap below it. The per-CONC table in the PR description confirms this (CONC=32 → MAX_NUM_SEQS=32). So the load-bearing premise of the EP/CONC-guard rationale comment ("max-num-seqs=4 caps the running batch below 32") no longer holds.

Step-by-step proof:

  1. YAML max conc per the lines 22-28 comment is 32.
  2. After this PR, the script computes MAX_NUM_SEQS=$(( 32 < 4 ? 4 : 32 )) when CONC=32.
  3. $(( 32 < 4 ? 4 : 32 )) = 32.
  4. The server is launched with --max-num-seqs 32, so the scheduler can assemble batches up to 32.
  5. The lines 22-28 comment still says max-num-seqs=4 caps the batch below 32. That sentence is now factually false at CONC=32.

Why the new lines 207-215 don't make this self-correcting: the new comment block describes what the new sizing does ("sized to the client concurrency with a floor at 4") but doesn't touch the EP-guard rationale. A future reader who looks at the EP guard at lines 22-30 to understand why only EP_SIZE != 1 exits-fatal (and why the symmetric CONC guard was dropped) will read the now-false claim that "max-num-seqs=4 caps the running batch below the YAML's max conc (32)" as the justification.

Addressing the refutation: The refuter argues this is comment drift that doesn't lead a reader to take an incorrect action, citing the nit threshold. That's why this is filed as nit, not normal — agreed it doesn't affect runtime. But the lines 22-28 block is the only place that documents the EP-guard reasoning in this script, and the false premise materially changes the argument: the original rationale was "we kept the CONC guard's safety net via the max-num-seqs cap"; with this PR there is no such cap at CONC=32. Two out of two co-located comment blocks now disagree with each other in the same file, and the fix is a one-line edit. Worth folding into this PR.

How to fix: update the sentence in lines 22-28 to match the new sizing, e.g. "max-num-seqs is sized to CONC (floored at 4), so the running batch can reach the YAML's max conc — gsm8k accuracy at conc>1 vs conc=1 is the canary for the kv_cache[:1,...] hardcode biting." Or simply delete the parenthetical about the cap and point to the lines 207-215 block.

@claude claude Bot mentioned this pull request Apr 26, 2026
Qiaolin-Yu pushed a commit that referenced this pull request Apr 26, 2026
* dsv4-fp4 sglang b200/b300: floor --max-running-requests at 8

Mirrors the floor-of-4 pattern from the mi355x atom script (#1170);
prevents tiny CONC values from yielding sub-optimal max-running-requests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* perf-changelog: add entry for #1173 (max-running-requests floor)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* revert b200 change; scope to b300 only

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant