Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 10 additions & 7 deletions benchmarks/single_node/dsv4_fp4_mi355x_atom.sh
Original file line number Diff line number Diff line change
Expand Up @@ -205,20 +205,23 @@ set -x

BLOCK_SIZE=${BLOCK_SIZE:-16}
# --enforce-eager is required: ROCm/ATOM#650 (PR1 skeleton) has no CUDAGraph
# support yet (deferred to a follow-up PR). max-num-seqs uses the ATOM
# default (512) — matches every other ATOM benchmark script in the repo.
# The PR1 kv_cache[:1,...] hardcode in deepseek_v4.py means any forward
# with batch>1 silently corrupts non-slot-0 lanes; this risk activates
# whenever the scheduler assembles batch>1, regardless of the explicit
# max-num-seqs value, so pinning it to 4 (the PR's offline repro value)
# offered no protective benefit. eval (gsm8k) at conc>1 is the canary.
# support yet (deferred to a follow-up PR). max-num-seqs is sized to the
# client concurrency with a floor at 4 — the ATOM default (512) makes the
# KV/GDN-mamba allocator overshoot the GPU budget ("GDN mamba tensor
# exceeds available KV budget"), and using 1 hangs warmup at 0% GPU. 4
# is the minimum we've seen complete warmup successfully (also the PR's
# offline repro value). The PR1 kv_cache[:1,...] hardcode in
# deepseek_v4.py means any forward with batch>1 silently corrupts
# non-slot-0 lanes; eval (gsm8k) at conc>1 is the canary.
MAX_NUM_SEQS=$(( CONC < 4 ? 4 : CONC ))
python3 -m atom.entrypoints.openai_server \
Comment on lines +215 to 217
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: the EP-guard rationale block at lines 22-28 still claims "max-num-seqs=4 caps the running batch below the YAML's max conc (32)", but with this PR's MAX_NUM_SEQS=$(( CONC < 4 ? 4 : CONC )), CONC=32 yields max-num-seqs=32 — equal to (not below) the YAML max. Worth updating that comment in the same change so the two rationale blocks (lines 22-28 vs the new 207-216) tell the same story.

Extended reasoning...

What's stale: Lines 22-28 of benchmarks/single_node/dsv4_fp4_mi355x_atom.sh are unchanged by this PR and still read:

The CONC guard was relaxed to empirically probe whether kv_cache[:1,...] in deepseek_v4.py actually corrupts at batch>1 in the server path: max-num-seqs=4 caps the running batch below the YAML's max conc (32), and per-sequence eval correctness will tell us if the hardcode bites.

Why it's now wrong: the new line 216 introduces MAX_NUM_SEQS=$(( CONC < 4 ? 4 : CONC )). For the YAML's max CONC=32, this evaluates to MAX_NUM_SEQS=32 — the running batch ceiling now equals the YAML max conc, it does not cap below it. The per-CONC table in the PR description confirms this (CONC=32 → MAX_NUM_SEQS=32). So the load-bearing premise of the EP/CONC-guard rationale comment ("max-num-seqs=4 caps the running batch below 32") no longer holds.

Step-by-step proof:

  1. YAML max conc per the lines 22-28 comment is 32.
  2. After this PR, the script computes MAX_NUM_SEQS=$(( 32 < 4 ? 4 : 32 )) when CONC=32.
  3. $(( 32 < 4 ? 4 : 32 )) = 32.
  4. The server is launched with --max-num-seqs 32, so the scheduler can assemble batches up to 32.
  5. The lines 22-28 comment still says max-num-seqs=4 caps the batch below 32. That sentence is now factually false at CONC=32.

Why the new lines 207-215 don't make this self-correcting: the new comment block describes what the new sizing does ("sized to the client concurrency with a floor at 4") but doesn't touch the EP-guard rationale. A future reader who looks at the EP guard at lines 22-30 to understand why only EP_SIZE != 1 exits-fatal (and why the symmetric CONC guard was dropped) will read the now-false claim that "max-num-seqs=4 caps the running batch below the YAML's max conc (32)" as the justification.

Addressing the refutation: The refuter argues this is comment drift that doesn't lead a reader to take an incorrect action, citing the nit threshold. That's why this is filed as nit, not normal — agreed it doesn't affect runtime. But the lines 22-28 block is the only place that documents the EP-guard reasoning in this script, and the false premise materially changes the argument: the original rationale was "we kept the CONC guard's safety net via the max-num-seqs cap"; with this PR there is no such cap at CONC=32. Two out of two co-located comment blocks now disagree with each other in the same file, and the fix is a one-line edit. Worth folding into this PR.

How to fix: update the sentence in lines 22-28 to match the new sizing, e.g. "max-num-seqs is sized to CONC (floored at 4), so the running batch can reach the YAML's max conc — gsm8k accuracy at conc>1 vs conc=1 is the canary for the kv_cache[:1,...] hardcode biting." Or simply delete the parenthetical about the cap and point to the lines 207-215 block.

--model $MODEL \
--server-port $PORT \
-tp $TP \
--kv_cache_dtype fp8 $CALCULATED_MAX_MODEL_LEN $EP \
--block-size $BLOCK_SIZE \
--enforce-eager \
--max-num-seqs $MAX_NUM_SEQS \
--trust-remote-code > $SERVER_LOG 2>&1 &

SERVER_PID=$!
Expand Down
11 changes: 11 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1844,3 +1844,14 @@
- "Pinned to PR1 limitations: single-sequence kv_cache hardcode, --enforce-eager required, ATOM_USE_TRITON_MOE=1 (aiter fused_moe broken on gfx950)"
- "Sweep will expand to TP=4/8 conc 4–256 once ROCm/ATOM PR3 (multi-request) and PR4 (CUDAGraph) land"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1165

- config-keys:
- dsv4-fp4-mi355x-atom
description:
- "Add DeepSeek-V4-Pro FP4 MI355X ATOM Day-0 marker (single-sequence, TP=8, conc=1)"
- "Image: rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post (matches qwen3.5-fp8-mi355x-atom base); ROCm/ATOM#650 overlaid at runtime via pip install --no-deps -e . from a pinned PR SHA (cdbff35) inside the benchmark script"
- "triton_kernels is missing from the release image (build-stage path /triton-test/python/triton_kernels/ is cleaned up); the script falls back to ROCm/triton@e491726 (RI3.5.x), which has matmul_ogs.py and routing.py (PR #650 imports both — upstream triton-lang/triton refactored matmul_ogs into matmul.py and removed routing) plus CDNA4MXScaleLayout and a target_info.py compatible with the image's bundled triton"
- "Model: deepseek-ai/DeepSeek-V4-Pro (same canonical checkpoint used by dsv4-fp4-b300-vllm); compatibility with PR #650's WeightsMapper not yet verified — first run will tell us"
- "Pinned to PR1 limitations: single-sequence kv_cache hardcode, --enforce-eager required, ATOM_USE_TRITON_MOE=1 (aiter fused_moe broken on gfx950)"
- "Sweep will expand to TP=4/8 conc 4–256 once ROCm/ATOM PR3 (multi-request) and PR4 (CUDAGraph) land"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1170
Loading