Skip to content

improve NVIDIA CI stability#75

Merged
functionstackx merged 3 commits intomainfrom
nvidia-ci-stability
Oct 1, 2025
Merged

improve NVIDIA CI stability#75
functionstackx merged 3 commits intomainfrom
nvidia-ci-stability

Conversation

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator

@kedarpotdar-nv kedarpotdar-nv commented Sep 30, 2025

Changes

  • Disable NCCL Graph Registration

NCCL_GRAPH_REGISTER tries to automatically enable user buffer registration with CUDA Graphs. Disabling it can reduce our vLLM and SGLang perf but will improve CI stability.

  • remove enroot --container-mount-home

This causes conflicts with pre-installed packages inside user directory and caused last night's runs to fail

See successful run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/18146263616

@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator Author

Also modified docker run to include --init to improve stability by better handling zombie processes

https://www.paolomainardi.com/posts/docker-run-init/

@functionstackx functionstackx merged commit 0027aad into main Oct 1, 2025
@functionstackx functionstackx deleted the nvidia-ci-stability branch October 1, 2025 00:06
@cquil11 cquil11 added the NVIDIA label Apr 8, 2026
Oseltamivir added a commit that referenced this pull request Apr 25, 2026
srtctl SrtConfig schema rejects backend.connector for the sglang
backend type. The field was carried over from the dynamo-vllm dsv4
recipes (where it is valid and set to null). PR #69/#75 sglang
recipes upstream do not declare it.
Oseltamivir added a commit that referenced this pull request Apr 25, 2026
…ckwell sglang fork

Re-installing dynamo 0.8.1 over the lmsysorg/sglang:deepseek-v4-grace-blackwell
container's pre-baked sglang fails at import time:

    File ".../dynamo/sglang/health_check.py", line 20
      def _get_bos_token_id_from_engine(engine: Optional[sgl.Engine])
    AttributeError: module 'sglang' has no attribute 'Engine'

The DSV4 sglang fork bundled in this image does not expose sgl.Engine.
Drop the dynamo: block so srtctl uses the dynamo build pre-installed in
the container — matches NVIDIA/srt-slurm PR #75 (the only upstream
DSV4 sglang disagg recipe), which also has no dynamo: block.
Oseltamivir added a commit that referenced this pull request Apr 25, 2026
…tch types broken

Run after the deepep-mode: low_latency change failed again. Logs show
two distinct DeepEP-path failures:

1. Prefill scheduler crash:
     File '.../sglang/srt/layers/quantization/mxfp4_deepseek.py', line 347
       topk_output = dispatch_output.topk_output
     AttributeError: 'DeepEPLLDispatchOutput' object has no attribute 'topk_output'
   The earlier crash had 'DeepEPNormalDispatchOutput' — neither dispatch
   output type in this image's sglang fork exposes topk_output, so
   forcing low_latency vs normal mode does not help. mxfp4_deepseek.py
   is a fork-only file (does not exist in upstream sgl-project/sglang),
   so the API mismatch can only be fixed by rebuilding the image.

2. Decode CUDA graph capture crash:
     RuntimeError: Failed: Assertion error /sgl-workspace/DeepEP/csrc/deep_ep.cpp:1233
       'x.size(0) == topk_idx.size(0) and x.size(0) <= num_max_dispatch_tokens_per_rank'
   DeepEP low_latency_dispatch's per-rank token cap is exceeded by the
   cuda-graph-max-bs we configured.

Both failures are in the DeepEP path. Per upstream sgl-project/sglang
(server_args.py), moe_a2a_backend defaults to 'none', which uses
all-reduce/all-gather dispatch and lets TP shard the expert weights
across ranks (no separate EP needed). NVIDIA/srt-slurm PR #75 (the
only upstream DSV4 sglang disagg recipe) takes the same TP-only stance
— pure tensor-parallel-size: N with no enable-dp-attention, no
moe-a2a-backend deepep, no dp-size, no ep-size.

Drop those five fields from all 6 recipes. Topology shape preserved:
- 1k1k 1p1d: P TP=8 / D TP=8 (4 nodes)
- 1k1k 1p1d-wide: P TP=8 / D TP=16 (6 nodes)
- 1k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes)
- 8k1k 1p1d: P TP=8 / D TP=8 (4 nodes)
- 8k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes)
- 8k1k 7p1d-wide: P 7*TP=8 / D TP=16 (18 nodes)

DSV4-Pro at MXFP4 (~340 GB) shards comfortably under TP=8 (~42 GB/rank)
or TP=16 (~21 GB/rank) with mem-fraction-static: 0.82 leaving plenty of
KV cache headroom on each 96 GB GB200 GPU.

Topology filenames retain the 'dep8' / 'dep16' historical names from
the vLLM PR #1129 sibling for symmetry — the actual sglang_config is
TP-only.
Oseltamivir added a commit that referenced this pull request Apr 25, 2026
…ility at TP=8/16

After the DeepEP removal, model load crashed at:

  File '.../sglang/srt/layers/quantization/fp8.py', line 282, in validate_block_quant_shapes
    raise ValueError(
  ValueError: Weight output_partition_size = 192 is not divisible
              by weight quantization block_n = 128.

DSV4-Pro's shared-experts gate_up_proj (intermediate ~1536) FP8-quants
in 128-element blocks. With TP=8 the per-rank slice is 1536/8=192,
which fails the divisibility check. PR #75 sidesteps this by using
TP=4 (1536/4=384), but that locks us into single-node workers.

sglang's --moe-dense-tp-size flag is the documented workaround
(server_args.py: 'useful when, with large TP size, there are errors
caused by weights in MLP layers having dimension smaller than the
min dimension GEMM supports'). Setting moe-dense-tp-size: 1 runs the
shared / dense-MLP layers replicated across ranks (TP=1) while the
rest of the model — attention, routed experts — keeps TP=8/16. Memory
cost is small since shared experts are a fraction of total weights.

Applied to all 6 recipes; topology/node counts unchanged.
Oseltamivir added a commit that referenced this pull request Apr 25, 2026
Run after moe-dense-tp-size: 1 added still hit:
  ValueError: Weight output_partition_size = 192 is not divisible
              by weight quantization block_n = 128.

Verified in upstream sglang dp_attention.py (compute_dp_attention_local_info):
  if not enable_dp_attention:
      return tp_rank, tp_size, 0   # moe_dense_tp_size IGNORED
The flag is only honored when enable_dp_attention=True. Since we
already dropped DP-attention to avoid the fork's mxfp4_deepseek bug,
moe-dense-tp-size: 1 was a no-op.

Two valid paths:
  (a) re-enable DP-attention without DeepEP — speculative, never tested
  (b) drop to TP=4 — 1536/4=384 divides cleanly by 128, FP8 quant
      passes. Matches NVIDIA/srt-slurm PR #75 (the only verified-
      working DSV4 sglang disagg recipe upstream) verbatim.

Going with (b). Recipes drop moe-dense-tp-size (no longer needed at
TP=4) and switch tensor-parallel-size to 4 in both prefill+decode.
gpus_per_prefill / gpus_per_decode drop to 4 (single GB200 node per
worker). prefill_nodes / decode_nodes track worker counts.

Topology shape (filenames keep historical dep8/dep16 naming for
symmetry with the vLLM #1129 sibling; actual config is TP=4):
  - 1k1k 1p1d-tep8:    P TP=4 / D TP=4 (2 nodes total)
  - 1k1k 1p1d-dep16:   P TP=4 / D TP=4 (2 nodes total) — same shape, different conc
  - 1k1k 3p1d-dep16:   P 3*TP=4 / D TP=4 (4 nodes)
  - 8k1k 1p1d-tep8:    P TP=4 / D TP=4 (2 nodes)
  - 8k1k 3p1d-dep16:   P 3*TP=4 / D TP=4 (4 nodes)
  - 8k1k 7p1d-dep16:   P 7*TP=4 / D TP=4 (8 nodes)

nvidia-master.yaml updated to match (tp: 4, ep: 1, dp-attn: false on
every prefill+decode block — including the commented 8k/1k block).

Also bumped SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK 1024 → 2048
in all env blocks (DeepEP path is dormant in this config but the env
var is in place for re-enabling later).
Oseltamivir added a commit that referenced this pull request Apr 30, 2026
* Day 0 DeepSeek V4 Pro FP4 GB200 disaggregated SGLang benchmarks

* Drop unsupported backend.connector field from sglang recipes

srtctl SrtConfig schema rejects backend.connector for the sglang
backend type. The field was carried over from the dynamo-vllm dsv4
recipes (where it is valid and set to null). PR #69/#75 sglang
recipes upstream do not declare it.

* Drop dynamo: version: 0.8.1 — incompatible with deepseek-v4-grace-blackwell sglang fork

Re-installing dynamo 0.8.1 over the lmsysorg/sglang:deepseek-v4-grace-blackwell
container's pre-baked sglang fails at import time:

    File ".../dynamo/sglang/health_check.py", line 20
      def _get_bos_token_id_from_engine(engine: Optional[sgl.Engine])
    AttributeError: module 'sglang' has no attribute 'Engine'

The DSV4 sglang fork bundled in this image does not expose sgl.Engine.
Drop the dynamo: block so srtctl uses the dynamo build pre-installed in
the container — matches NVIDIA/srt-slurm PR #75 (the only upstream
DSV4 sglang disagg recipe), which also has no dynamo: block.

* Add dynamo: install: false — srtctl default is install=True

srtctl's DynamoConfig (src/srtctl/core/schema.py L680) defaults to
install=True, which pip installs dynamo 0.8.0 even when no `dynamo:`
block is specified. Use the explicit opt-out so srtctl uses the dynamo
build baked into the lmsysorg/sglang:deepseek-v4-grace-blackwell
image. This image's sglang fork doesn't expose sgl.Engine, which
dynamo.sglang.health_check imports at top level — re-installing
dynamo over it breaks startup.

* Pin dynamo to v1.2.0-sglang-deepseek-v4-dev.1 tag (hash 21f135f5)

install: false fixed the pip-install crash, but the
lmsysorg/sglang:deepseek-v4-grace-blackwell image doesn't have dynamo
pre-installed (ModuleNotFoundError: No module named 'dynamo'), so
srtctl needs to install something compatible.

The DSV4-targeted dynamo tag v1.2.0-sglang-deepseek-v4-dev.1 (sha
21f135f5edf40e12e6ff5db2b462d862a6d6ab9b) includes
'from __future__ import annotations' in dynamo/sglang/health_check.py
(ai-dynamo PR #7255, commit cdb7218a, 2026-03-12), which makes the
Optional[sgl.Engine] annotation lazy. The PyPI 0.8.0/0.8.1 releases
predate that fix and crash with AttributeError on this image's
sglang fork.

* Force deepep-mode: low_latency to work around mxfp4+DeepEP normal-dispatch bug

Prefill warmup crashed in run 24941291328 with:

  File ".../sglang/srt/layers/quantization/mxfp4_deepseek.py", line 347
    topk_output = dispatch_output.topk_output
  AttributeError: 'DeepEPNormalDispatchOutput' object has no attribute 'topk_output'

Per sglang server_args.py, --deepep-mode defaults to 'auto', which
picks 'normal' for prefill batches and 'low_latency' for decode. The
mxfp4_deepseek MoE kernel only handles the low_latency dispatch
output shape (which carries topk_output); the normal-dispatch output
type does not, so any prefill forward (or decode warmup using
forward_idle) hits the AttributeError before the worker can serve.

Force deepep-mode: low_latency on every prefill + decode block that
uses moe-a2a-backend: deepep. The two 1p1d-dep8-tep8 decode blocks
remain TP-only (no DeepEP) and are unaffected.

Run reference: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24941291328

* Drop DeepEP / DP-attn / EP — fork-only mxfp4_deepseek bug, both dispatch types broken

Run after the deepep-mode: low_latency change failed again. Logs show
two distinct DeepEP-path failures:

1. Prefill scheduler crash:
     File '.../sglang/srt/layers/quantization/mxfp4_deepseek.py', line 347
       topk_output = dispatch_output.topk_output
     AttributeError: 'DeepEPLLDispatchOutput' object has no attribute 'topk_output'
   The earlier crash had 'DeepEPNormalDispatchOutput' — neither dispatch
   output type in this image's sglang fork exposes topk_output, so
   forcing low_latency vs normal mode does not help. mxfp4_deepseek.py
   is a fork-only file (does not exist in upstream sgl-project/sglang),
   so the API mismatch can only be fixed by rebuilding the image.

2. Decode CUDA graph capture crash:
     RuntimeError: Failed: Assertion error /sgl-workspace/DeepEP/csrc/deep_ep.cpp:1233
       'x.size(0) == topk_idx.size(0) and x.size(0) <= num_max_dispatch_tokens_per_rank'
   DeepEP low_latency_dispatch's per-rank token cap is exceeded by the
   cuda-graph-max-bs we configured.

Both failures are in the DeepEP path. Per upstream sgl-project/sglang
(server_args.py), moe_a2a_backend defaults to 'none', which uses
all-reduce/all-gather dispatch and lets TP shard the expert weights
across ranks (no separate EP needed). NVIDIA/srt-slurm PR #75 (the
only upstream DSV4 sglang disagg recipe) takes the same TP-only stance
— pure tensor-parallel-size: N with no enable-dp-attention, no
moe-a2a-backend deepep, no dp-size, no ep-size.

Drop those five fields from all 6 recipes. Topology shape preserved:
- 1k1k 1p1d: P TP=8 / D TP=8 (4 nodes)
- 1k1k 1p1d-wide: P TP=8 / D TP=16 (6 nodes)
- 1k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes)
- 8k1k 1p1d: P TP=8 / D TP=8 (4 nodes)
- 8k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes)
- 8k1k 7p1d-wide: P 7*TP=8 / D TP=16 (18 nodes)

DSV4-Pro at MXFP4 (~340 GB) shards comfortably under TP=8 (~42 GB/rank)
or TP=16 (~21 GB/rank) with mem-fraction-static: 0.82 leaving plenty of
KV cache headroom on each 96 GB GB200 GPU.

Topology filenames retain the 'dep8' / 'dep16' historical names from
the vLLM PR #1129 sibling for symmetry — the actual sglang_config is
TP-only.

* Add moe-dense-tp-size: 1 — fix shared-experts FP8 block-quant divisibility at TP=8/16

After the DeepEP removal, model load crashed at:

  File '.../sglang/srt/layers/quantization/fp8.py', line 282, in validate_block_quant_shapes
    raise ValueError(
  ValueError: Weight output_partition_size = 192 is not divisible
              by weight quantization block_n = 128.

DSV4-Pro's shared-experts gate_up_proj (intermediate ~1536) FP8-quants
in 128-element blocks. With TP=8 the per-rank slice is 1536/8=192,
which fails the divisibility check. PR #75 sidesteps this by using
TP=4 (1536/4=384), but that locks us into single-node workers.

sglang's --moe-dense-tp-size flag is the documented workaround
(server_args.py: 'useful when, with large TP size, there are errors
caused by weights in MLP layers having dimension smaller than the
min dimension GEMM supports'). Setting moe-dense-tp-size: 1 runs the
shared / dense-MLP layers replicated across ranks (TP=1) while the
rest of the model — attention, routed experts — keeps TP=8/16. Memory
cost is small since shared experts are a fraction of total weights.

Applied to all 6 recipes; topology/node counts unchanged.

* Set SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024 in all env blocks

Belt-and-suspenders for the DeepEP per-rank dispatch buffer cap. The
default is too low; with this set we'll have headroom if EP / DeepEP
is re-enabled later (e.g., once the fork's mxfp4_deepseek dispatch API
mismatch is fixed). 1024 matches the cookbook's B200 decode reference.

* Switch to TP=4 single-node — match PR #75 verbatim, fix FP8 block-quant

Run after moe-dense-tp-size: 1 added still hit:
  ValueError: Weight output_partition_size = 192 is not divisible
              by weight quantization block_n = 128.

Verified in upstream sglang dp_attention.py (compute_dp_attention_local_info):
  if not enable_dp_attention:
      return tp_rank, tp_size, 0   # moe_dense_tp_size IGNORED
The flag is only honored when enable_dp_attention=True. Since we
already dropped DP-attention to avoid the fork's mxfp4_deepseek bug,
moe-dense-tp-size: 1 was a no-op.

Two valid paths:
  (a) re-enable DP-attention without DeepEP — speculative, never tested
  (b) drop to TP=4 — 1536/4=384 divides cleanly by 128, FP8 quant
      passes. Matches NVIDIA/srt-slurm PR #75 (the only verified-
      working DSV4 sglang disagg recipe upstream) verbatim.

Going with (b). Recipes drop moe-dense-tp-size (no longer needed at
TP=4) and switch tensor-parallel-size to 4 in both prefill+decode.
gpus_per_prefill / gpus_per_decode drop to 4 (single GB200 node per
worker). prefill_nodes / decode_nodes track worker counts.

Topology shape (filenames keep historical dep8/dep16 naming for
symmetry with the vLLM #1129 sibling; actual config is TP=4):
  - 1k1k 1p1d-tep8:    P TP=4 / D TP=4 (2 nodes total)
  - 1k1k 1p1d-dep16:   P TP=4 / D TP=4 (2 nodes total) — same shape, different conc
  - 1k1k 3p1d-dep16:   P 3*TP=4 / D TP=4 (4 nodes)
  - 8k1k 1p1d-tep8:    P TP=4 / D TP=4 (2 nodes)
  - 8k1k 3p1d-dep16:   P 3*TP=4 / D TP=4 (4 nodes)
  - 8k1k 7p1d-dep16:   P 7*TP=4 / D TP=4 (8 nodes)

nvidia-master.yaml updated to match (tp: 4, ep: 1, dp-attn: false on
every prefill+decode block — including the commented 8k/1k block).

Also bumped SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK 1024 → 2048
in all env blocks (DeepEP path is dormant in this config but the env
var is in place for re-enabling later).

* Restore mi355x retry changelog entries clobbered by merge

The merge of main into this branch (c0aec93) accidentally overwrote
the two dsv4-fp8-mi355x-sglang retry entries (PR #1148 retry-pair tail
and PR #1159 retry-pair) with duplicated copies of our own
dsv4-fp4-gb200-dynamo-sglang entry. The process_changelog.py gate
rejects deletions, so the workflow blocked.

Restore the two mi355x entries verbatim from origin/main and keep a
single copy of our dsv4 entry, appended after the restored mi355x
block. perf-changelog.yaml diff vs origin/main is now additions-only.

* Switch back to TP=8: enable-dp-attention + moe-dense-tp-size: 1, no moe-a2a-backend

TP=4 OOMed — DSV4-Pro at MXFP4 doesn't fit on a single GB200 node.
Need TP=8 across 2 nodes (768 GB total).

But TP=8 trips two issues that earlier rounds papered over:
  a) shared-experts gate_up_proj FP8 block-quant divisibility
     (1536/8=192, not a multiple of block_n=128)
  b) the lmsysorg/sglang:deepseek-v4-grace-blackwell fork's
     mxfp4_deepseek kernel crashes on every DeepEP forward path

Single combo that solves both — verified in upstream sglang source:
  * enable-dp-attention: true  +  moe-dense-tp-size: 1
    Runs dense / shared-MLP layers replicated (TP=1) — fixes (a).
    moe-dense-tp-size IS gated on enable_dp_attention=True per
    python/sglang/srt/layers/dp_attention.py
    (compute_dp_attention_local_info ignores it when DP-attn is off).
  * NO moe-a2a-backend set (default 'none')
    Lands the model on forward_normal instead of forward_deepep —
    avoids (b). Verified in deepseek_v2.py:
      _enable_a2a_moe = is_deepep | is_mooncake | is_nixl | is_mori
                       | is_ascend_fuseep | is_flashinfer
    With backend='none' this is False and forward_normal runs.

Recipes: tensor-parallel-size 4 → 8 (both prefill+decode); add
moe-dense-tp-size: 1, enable-dp-attention: true, dp-size: 8 to every
sglang_config block; gpus_per_prefill / gpus_per_decode 4 → 8;
prefill_nodes / decode_nodes scale to workers × 2.

nvidia-master.yaml mirrors: tp 4 → 8, dp-attn false → true on every
prefill+decode block (active 1k/1k + commented 8k/1k). Topology shape
restored to:
  - 1k1k 1p1d-* : 4 nodes (was 2)
  - 1k1k 3p1d-* : 8 nodes (was 4)
  - 8k1k 1p1d-* : 4 nodes (commented)
  - 8k1k 3p1d-* : 8 nodes (commented)
  - 8k1k 7p1d-* : 16 nodes (commented)

* Scope sweep to high-conc DeepEP only — temporarily comment 1p1d blocks

Comment out the low-conc (1-64) and mid-conc (128-4096) search-space
entries in nvidia-master.yaml so the sweep iterates only on the high-
conc 3p1d-dep8-dep16 topology. Re-enable DeepEP on that one recipe to
exercise the EP path:

  3p1d-dep8-dep16 prefill+decode:
    + ep-size: 8
    + moe-a2a-backend: "deepep"
    + deepep-mode: low_latency
    (kept enable-dp-attention + moe-dense-tp-size: 1 + tp=8 / dp=8)

Master matrix label updated to ep=8 to reflect the recipe.

Sibling 1p1d recipes on disk are unchanged (still TP=8 + DP-attn,
no DeepEP). They are still referenced by the commented-out master
entries — restore them by uncommenting.

* tep fix + dep for high conc

* sike no dpa

* Cap SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK at 1024 — sglang LL hard ceiling

DeepEP run (3p1d-dep8-dep16) crashed at:

  File '.../sglang/srt/layers/moe/token_dispatcher/deepep.py', line 325
    assert self.num_max_dispatch_tokens_per_rank <= 1024
  AssertionError

_DeepEPDispatcherImplLowLatency enforces a hard upper bound of 1024 in
low_latency mode. We had bumped the env var to 2048 to give headroom
above the earlier C++ side cap (deep_ep.cpp:1233 'x.size(0) <=
num_max_dispatch_tokens_per_rank'), but 2048 trips this Python-side
assertion at scheduler init. 1024 is the exactly-allowed value: high
enough to cover the cuda-graph-max-bs we use, low enough to satisfy
the LL dispatcher constructor.

Apply 2048 → 1024 across all 6 recipes (every prefill + decode env
block).

* Revert 3p1d-dep8-dep16 to no-DeepEP TP-only; uncomment full 1k/1k + 8k/1k sweep

DeepEP is broken on the lmsysorg/sglang:deepseek-v4-grace-blackwell
image — verified across three runs (deepep-mode auto/normal,
deepep-mode low_latency, and the latest 3p1d try). All hit the
fork-only mxfp4_deepseek.py:347 reading dispatch_output.topk_output,
which neither DeepEPLLDispatchOutput nor DeepEPNormalDispatchOutput
exposes in this fork. Cannot be fixed from the recipe — needs the
image rebuilt with mxfp4_deepseek patched, or an upstream sglang fix.

3p1d-dep8-dep16 recipe: drop ep-size, moe-a2a-backend, deepep-mode
from prefill+decode. Now matches the 1p1d siblings: TP=8 + DP=8 +
moe-dense-tp-size: 1, default 'none' a2a backend (forward_normal
path bypasses the buggy mxfp4_deepseek kernel).

nvidia-master.yaml:
  * Uncomment the 1k/1k mid-conc and 8k/1k blocks (low + mid + high).
  * 3p1d-dep8-dep16 matrix label ep: 8 → ep: 1 to match recipe.

Sweep now expands to 6 entries / 27 conc points (3 1k/1k + 3 8k/1k).

* Try moe-a2a-backend: flashinfer on 3p1d-dep8-dep16 for high-conc EP

DeepEP is dead in this image (mxfp4_deepseek.py:347 reads
dispatch_output.topk_output, neither DeepEPNormal nor DeepEPLL output
exposes that field). Smoke test the only other plausible EP backend
upstream sglang offers: flashinfer.

Per upstream docs/advanced_features/expert_parallelism.md, flashinfer
is the documented option for 'Large-scale EP deployments' and uses a
different dispatcher than DeepEP — its output class may or may not
trip the same mxfp4_deepseek bug. Per server_args.py _handle_a2a_moe,
flashinfer auto-sets SGLANG_MOE_NVFP4_DISPATCH=True and forces
ep_size = tp_size, so we set ep-size: 8 explicitly. Everything else
(TP=8 / DP=8 / moe-dense-tp-size: 1) stays so the FP8 block-quant
path remains valid.

Scope: 1k/1k 3p1d-dep8-dep16 only. If the EP path serves on this
image, port back to the 1p1d siblings; if it crashes the same way
DeepEP did, revert to the no-EP forward_normal path and accept the
TP-only pareto.

nvidia-master.yaml matrix labels for the 3p1d entry updated to ep=8
to match the recipe.

* Revert flashinfer EP attempt — accept TP-only pareto, every EP backend dead on this image

flashinfer EP smoke test (3p1d-dep8-dep16 1k/1k) crashed at startup:

  File '.../sglang/srt/server_args.py', line 2133, in _handle_a2a_moe
    assert self.moe_runner_backend in [...]
  AssertionError: Flashinfer MoE A2A is only supported with
                  flashinfer_cutlass moe runner backend

flashinfer_cutlass is FP8-only — won't load DSV4-Pro's MXFP4 weights.
The only path that satisfies the assertion would also fail at model
load. So flashinfer is unusable for DSV4 on any image that doesn't
ship a flashinfer_mxfp4_cutlass runner (which doesn't exist).

Combined with the earlier deepep failure (mxfp4_deepseek.py:347
AttributeError on dispatch_output.topk_output, both Normal and LL
dispatch types), every EP backend sglang exposes in this image is
dead. Remaining options (mooncake, nixl-ep, mori, ascend_fuseep) are
either Ascend-NPU-only or not wired into this image.

Revert 3p1d-dep8-dep16 recipe to no-EP TP-only (matches the 5 sibling
recipes) and master.yaml matrix labels (ep: 8 → ep: 1).

PR description's Known Issues section updated to a 4-row table
covering every EP backend tried and accepted as dead end.

* fix(sglang): bump 8k1k prefill max-running-requests from 4 to 8

sglang computes per-rank capacity as max_running_requests // dp_size.
With dp-size=8, a value of 4 floors to 0, hitting the
"max_running_request is zero" assertion in tp_worker.py:277.
Bump to 8 so each DP rank gets at least 1 slot — matches the
working 1p1d recipe.

* ports

* Dsv4 fp4 gb200 dynamo sglang disagg (#1213)

* Modify deepseek-v4 configuration for new model settings

* Update YAML configuration for deepseek model

* adapt for model path, etc

* dev

* upd

* fix

* fix

* test

* add gb300

* upd

* fix

* fix

* fix

* fix(launch_gb300-cw): register deepseek-v4-pro alias in model_paths

After fixing the recipe overlay path in 1b07108, srtctl now loads our
hand-rolled SGLang recipe and runs preflight, which rejects:

    Error: Preflight failed for recipes/sglang/.../disagg-gb300-2p1d-dep4-dep8.yaml:
    - model.path: Model 'deepseek-v4-pro' is not a local model path and
      is not defined in srtslurm.yaml model_paths.

Both `disagg-gb300-2p1d-dep4-dep8.yaml` and `disagg-gb300-7p1d-dep4-dep8.yaml`
declare `model.path: deepseek-v4-pro` (per the recipe header comment, the
alias is intentionally aligned with `launch_gb200-nv.sh`'s srtslurm.yaml,
which exports `SRT_SLURM_MODEL_PREFIX=deepseek-v4-pro`). The gb300-cw
launcher only registered `dspro` and `dsv4-pro`, so the alias never
resolved. Add `deepseek-v4-pro` mapping to the same `${MODEL_PATH}`.

* fix(launch_gb300-cw): pull arm64 squash and force fresh import per runner

After fixing model.path alias (fe6815c), the slurm orchestrator reached
the head infrastructure srun and crashed at:

    [ERROR] Invalid image format: /mnt/vast/squash_dupe/lmsysorg_sglang_deepseek-v4-grace-blackwell.sqsh
    error: pyxis: failed to create container filesystem
    error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1

Two issues:

1. The runner pod that runs `enroot import docker://lmsysorg/sglang:...`
   is x86, so without `--arch` enroot fetches the amd64 manifest. The
   compute nodes (slurm-gb300-138-*) are aarch64 and pyxis there
   rejects the amd64 squash with "Invalid image format". Pass
   `--arch arm64` and tag the cache filename with `_arm64`.

2. `enroot import -o existing.sqsh ...` aborts with
   `[ERROR] File already exists` and leaves the stale file in place,
   so once a half-baked or pre-tag-update squash lands at this path it
   is silently reused on every subsequent CI run. Inspecting
   /mnt/vast/squash_dupe showed an Apr 26 amd64 sqsh shadowing the
   Apr 28 working arm64 sqsh exactly like this. `rm -f` before each
   import forces fresh downloads and picks up Docker tag updates.

3. Scope the squash filename per RUNNER_NAME (gb300-cw_0..3) so that
   the four matrix runners do not race on rm+import of the same shared
   path on /mnt/vast. Cost: ~64 GB on /mnt/vast (4 runners × 16 GB
   per arm64 sqsh) instead of 16 GB shared, which is fine on the
   shared VAST mount.

* fix(launch_gb300-cw): use enroot --arch aarch64, not arm64

enroot 4.0.1's `common::debarch()` accepts kernel-style arch names
(`x86_64`, `aarch64`, `ppc64le`) and emits Docker-style names
(`amd64`, `arm64`, `ppc64le`) on the wire. Passing `--arch arm64` (the
Docker manifest name) trips the function's else branch immediately:

    [ERROR] Unsupported architecture: arm64

Use the kernel name `aarch64` so enroot can map it to docker's `arm64`
manifest internally.

* fix(launch_gb300-cw): use pre-staged arm64 sqsh, drop in-CI enroot import

Even with `--arch aarch64`, `enroot import` from the CI runner pod (x86)
fails when converting the arm64 image:

    [INFO] Converting whiteouts...
    /usr/bin/bash: line 1: /usr/bin/enroot-aufs2ovlfs: Operation not permitted
    (repeated dozens of times, then preflight reports the sqsh as missing)

`enroot-aufs2ovlfs` requires CAP_SYS_ADMIN that the runner pod doesn't
hold, and `lmsysorg/sglang:deepseek-v4-grace-blackwell` is arm64-only,
so the conversion can't be skipped either. Per the documented manual
flow at https://gist.github.com/Fridge003/42c6001e0bb613acf0e411305b8ea780
the import has to be dispatched to an aarch64 GB300 compute node via
`srun`.

Rather than running an extra slurm job per CI invocation just to
prepare the sqsh, point the launcher at the pre-staged arm64 sqsh that
already lives at
`/mnt/vast/squash_dupe/lmsysorg_sglang_deepseek-v4-grace-blackwell_arm64.sqsh`
(refreshed manually via the gist script when the docker tag is bumped).
The matching `nginx_1.27.4_arm64.sqsh` was symlinked alongside.

Add a fast-fail check so a missing pre-staged sqsh produces a clear
error instead of a confusing pyxis "Invalid image format" three steps
later.

* fix(launch_gb300-cw): persist dynamo wheel cache and ulimit preamble

Two follow-up fixes after CI started successfully reaching slurm but
the dynamo-from-source step (`dynamo: hash: 9d3c913d…`) is rebuilt cold
on every CI run, taking ~10-20 minutes per matrix job:

1. Cluster-wide dynamo wheel cache. srtctl's
   `_hash_cached_source_install` (`src/srtctl/core/schema.py:912`) is
   already designed to cache hash-pinned builds at
   `/configs/dynamo-wheels/<hash>/{ai_dynamo_runtime-*.whl,dynamo-src.tar.gz,.complete}`
   under flock. The cache only works if `/configs/dynamo-wheels` survives
   between CI runs, but the launcher does `rm -rf srt-slurm` and
   re-clones every time, blowing it away. Mount
   `/mnt/vast/dynamo-wheels-cache` (NFS, shared by every gb300-cw_N
   runner) over `/configs/dynamo-wheels` via srtslurm.yaml
   `default_mounts`, so the cache survives `rm -rf` and is shared
   across all matrix jobs. After the first cold build the warm path
   should drop dynamo install to ~30 s.

2. Cluster-wide bash preamble for ulimits. yangminl's manual setup on
   this cluster (`/mnt/home/yangminl/srt-slurm/srtslurm.yaml`) sets
   `default_bash_preamble: "ulimit -n 1048576 && ulimit -a"` so the
   dynamo frontend / sglang servers can accept the 8192-concurrency
   sweep without `EMFILE: too many open files`. Mirror that here. The
   feature is supported by srtctl's pinned commit
   (`src/srtctl/core/slurm.py:_get_cluster_bash_preamble`).

* fix(sglang/dsv4/8k1k recipes): set cpus-per-task=144 for dynamo build

slurm assigns 1 CPU/task by default; `scontrol show job <id>` from a
recent CI run shows `NumCPUs=4 NumTasks=4 CPUs/Task=1` with 4 nodes,
i.e. one core per worker. The dynamo `hash:` source install rebuilds
~500 rust crates (kube-client, tonic, hf-hub, image codecs ravif/exr,
pyo3 stack) and at one core takes 30+ min just for the cold build,
which dominates total CI time even with the new
`/configs/dynamo-wheels` cache (the cache only helps after the first
cold run).

Match yangminl's working manual setup
(`/mnt/home/yangminl/srt-slurm/recipes/dsv4-pro/sglang/gb300-fp4/all-dynamo.yaml`)
which sets `sbatch_directives.cpus-per-task: "144"` so cargo gets the
full GB300 host (144 cores) and finishes maturin in a few minutes.

* fix(sglang/dsv4/8k1k recipes): set cpus-per-task=144 and mem=0

slurm assigns 1 CPU/task by default; `scontrol show job 613` from a
running CI job confirmed `NumCPUs=4 NumTasks=4 CPUs/Task=1` with 4
nodes — one core per worker. The dynamo `hash:` cold source install
rebuilds ~500 rust crates (kube-client, tonic, hf-hub, image codecs
ravif/exr, the pyo3 stack) and at one core takes 30+ min just for the
cold build, which dominates total CI time even with the new
`/configs/dynamo-wheels` cache (the cache only helps after the first
cold run).

Match yangminl's working manual setup on the same gb300-cw cluster
(`/mnt/home/yangminl/srt-slurm/recipes/dsv4-pro/sglang/gb300-fp4/all-dynamo.yaml`)
which sets:
  sbatch_directives:
    cpus-per-task: "144"
    mem: "0"

cargo then gets the full 144-core GB300 host and finishes maturin in a
few minutes; mem=0 hands the worker the entire node's RAM so the
dynamo build + DSV4-Pro 671B FP4 weight load fit without OOM.

* fix(launch_gb300-cw): pin srt-slurm fork with parallel sa-bench

The current sa-bench in NVIDIA/srt-slurm@9d75f82 generates random
prompts single-threaded, which dominates 7p1d/conc=8192 bench startup
(~50 min just for the 81920-prompt main pass before the first HTTP
request reaches dynamo). Pin to fzyzcjy/srt-slurm fork branch
`feat/random-num-workers` (commit 8094cfb), which is 9d75f82 + the
SemiAnalysisAI/InferenceX `utils/bench_serving/` benchmark_serving.py
ported into sa-bench. With `--random-num-workers 48` (now the default
in bench.sh) prompt generation drops to ~1 min on a 144-core GB300
host, putting the bench-startup cost on the same order as
infra+model-load instead of dominating it.

The fork is paired with the upstream PR
NVIDIA/srt-slurm#114; once that merges, this
pin should revert to the bumped NVIDIA/srt-slurm SHA.

* fix(launch_gb300-cw): bump srt-slurm fork pin to minimal multiproc patch

Previous pin (8094cfb) was a wholesale replacement of sa-bench with
the SemiAnalysisAI/InferenceX bench_serving — that dropped
`async_request_dynamo_completions` from `ASYNC_REQUEST_FUNCS`, so
`bench.sh` would have died on `--backend dynamo` argparse rejection
the moment the bench client started.

New pin (4249d16) is a tight ~100-line patch on top of
NVIDIA/srt-slurm@9d75f82 that only adds parallel random prompt
generation (`--random-num-workers`); everything else, including the
dynamo backend and `--custom-tokenizer` plumbing, stays exactly the
same as upstream. See NVIDIA/srt-slurm#114.

* ci: temporarily comment out conc-list:[64] 2p1d entry

Focus CI on the conc=8192 7p1d max-throughput entry only — re-enable
the 2p1d/conc=64 mid-curve entry shortly once that's green.

* ci(eval): temporarily skip dsv4-fp4-gb300 dynamo-sglang eval-only entry

The srt-slurm pin (9d75f82, recipes/dsv4-agg-disagg) lacks the lm-eval
orchestrator path that lives on sa-submission-q2-2026. Skip the auto-generated
eval-only matrix entry for this config until the pin is bumped.

TODO: remove this branch once the pin is moved to sa-submission-q2-2026 (which
already carries the EVAL_ONLY do_sweep.py branch and lm-eval/bench.sh).

* bench(7p1d-dep4-dep8): swap sa-bench default for yangminl's gb300-cw recipe

Replace the sa-bench builder (concurrencies=8192, req_rate=inf, sa-bench
default num_prompts/num_warmups multipliers) with the exact custom
command from yangminl's gb300-cw 8k1k_hightpt[0] run (slurm job 564 on
the dsv4-pro-gb300-fp4 cluster):

  concurrency=4096, rate=48, num_prompts=40960, num_warmups=512,
  random_num_workers=96.

Why mirror those exact knobs: that recipe is what produced the 7p1d
reference numbers we benchmarked against (358K total tok/s, 39.9K output
tok/s, ~5s mean TTFT). Running sa-bench at concurrency=8192/rate=inf
will saturate the 1-decode-worker GPU (we observed 16384 concurrency on
job 617 saturated decode at ~390 running/rank with mean TTFT ~257s,
i.e. equilibrium gated by decode compute, not the bench), making the
result not directly comparable.

Bench framework note: the fzyzcjy fork's benchmark_serving.py /
benchmark_utils.py / encoding_dsv4.py are byte-identical to upstream
SemiAnalysisAI/InferenceX/main; only backend_request_func.py adds five
per-request debug print sites (ok=/lat=/url=/plen=/err=). Throughput
numbers should match sa-bench at the same flags; the fork is chosen
here to keep parity with the reference run's logs.

Skipped on purpose:
- DeepGEMM env knobs (SGLANG_DG_CACHE_DIR / SGLANG_JIT_DEEPGEMM_PRECOMPILE
  vs SGLANG_JIT_DEEPGEMM_FAST_WARMUP=1) — yangminl's cache dir is
  /configs/deepgemm_cache on the gb300-cw host and isn't portable here;
  PR's FAST_WARMUP path stays.
- expert_location_dispatch.py topk_ids int32 cast (yangminl commits
  94b7dc4c7 + e933ef2b1 on the patched sglang fork) — not pulling that
  into the container build.

* config(7p1d-dep4-dep8): align with job 564 — multi-frontend, sbatch dirs, name

Eliminate every non-cluster-specific diff vs job 564's resolved config
(`/outputs/564/config_8k1k_hightpt_0.yaml`):

- name: match `dsv4-pro-gb300-fp4_8k1k_hightpt_0` (was stale gb200 string)
- frontend.enable_multiple_frontends: false → true; add num_additional_frontends: 8
  (job 564 ran 9 dynamo frontends behind nginx; PR was running a single
  frontend, which was a real router-side runtime diff)
- slurm.time_limit: 8h → 3h to match job 564
- sbatch_directives.cpus-per-task: 144, mem: 0 (portable, was missing)
- drop health_check block (job 564 doesn't set it; rely on srtctl default)

Remaining diffs vs job 564 are all either cluster-specific path bindings
(slurm.partition=hpc-mid, frontend.nginx_container, extra_mount of
yangminl's patched sglang) or DG-cache env (SGLANG_DG_CACHE_DIR /
SGLANG_JIT_DEEPGEMM_PRECOMPILE) — those need InferenceX-cluster-side
equivalents and are documented in the header comment.

* config(7p1d-dep4-dep8): keep PR name field, revert to original

* upd

* fix

* fix

* middle

* fi

* fix

* upd

* fix

* upd

---------

Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Cheng Wan <chwan@rice.edu>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants