improve NVIDIA CI stability by kedarpotdar-nv · Pull Request #75 · SemiAnalysisAI/InferenceX

kedarpotdar-nv · 2025-09-30T23:41:41Z

Changes

Disable NCCL Graph Registration

NCCL_GRAPH_REGISTER tries to automatically enable user buffer registration with CUDA Graphs. Disabling it can reduce our vLLM and SGLang perf but will improve CI stability.

remove enroot --container-mount-home

This causes conflicts with pre-installed packages inside user directory and caused last night's runs to fail

See successful run: https://github.com/InferenceMAX/InferenceMAX/actions/runs/18146263616

kedarpotdar-nv · 2025-09-30T23:49:28Z

Also modified docker run to include --init to improve stability by better handling zombie processes

https://www.paolomainardi.com/posts/docker-run-init/

srtctl SrtConfig schema rejects backend.connector for the sglang backend type. The field was carried over from the dynamo-vllm dsv4 recipes (where it is valid and set to null). PR #69/#75 sglang recipes upstream do not declare it.

…ckwell sglang fork Re-installing dynamo 0.8.1 over the lmsysorg/sglang:deepseek-v4-grace-blackwell container's pre-baked sglang fails at import time: File ".../dynamo/sglang/health_check.py", line 20 def _get_bos_token_id_from_engine(engine: Optional[sgl.Engine]) AttributeError: module 'sglang' has no attribute 'Engine' The DSV4 sglang fork bundled in this image does not expose sgl.Engine. Drop the dynamo: block so srtctl uses the dynamo build pre-installed in the container — matches NVIDIA/srt-slurm PR #75 (the only upstream DSV4 sglang disagg recipe), which also has no dynamo: block.

…tch types broken Run after the deepep-mode: low_latency change failed again. Logs show two distinct DeepEP-path failures: 1. Prefill scheduler crash: File '.../sglang/srt/layers/quantization/mxfp4_deepseek.py', line 347 topk_output = dispatch_output.topk_output AttributeError: 'DeepEPLLDispatchOutput' object has no attribute 'topk_output' The earlier crash had 'DeepEPNormalDispatchOutput' — neither dispatch output type in this image's sglang fork exposes topk_output, so forcing low_latency vs normal mode does not help. mxfp4_deepseek.py is a fork-only file (does not exist in upstream sgl-project/sglang), so the API mismatch can only be fixed by rebuilding the image. 2. Decode CUDA graph capture crash: RuntimeError: Failed: Assertion error /sgl-workspace/DeepEP/csrc/deep_ep.cpp:1233 'x.size(0) == topk_idx.size(0) and x.size(0) <= num_max_dispatch_tokens_per_rank' DeepEP low_latency_dispatch's per-rank token cap is exceeded by the cuda-graph-max-bs we configured. Both failures are in the DeepEP path. Per upstream sgl-project/sglang (server_args.py), moe_a2a_backend defaults to 'none', which uses all-reduce/all-gather dispatch and lets TP shard the expert weights across ranks (no separate EP needed). NVIDIA/srt-slurm PR #75 (the only upstream DSV4 sglang disagg recipe) takes the same TP-only stance — pure tensor-parallel-size: N with no enable-dp-attention, no moe-a2a-backend deepep, no dp-size, no ep-size. Drop those five fields from all 6 recipes. Topology shape preserved: - 1k1k 1p1d: P TP=8 / D TP=8 (4 nodes) - 1k1k 1p1d-wide: P TP=8 / D TP=16 (6 nodes) - 1k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes) - 8k1k 1p1d: P TP=8 / D TP=8 (4 nodes) - 8k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes) - 8k1k 7p1d-wide: P 7*TP=8 / D TP=16 (18 nodes) DSV4-Pro at MXFP4 (~340 GB) shards comfortably under TP=8 (~42 GB/rank) or TP=16 (~21 GB/rank) with mem-fraction-static: 0.82 leaving plenty of KV cache headroom on each 96 GB GB200 GPU. Topology filenames retain the 'dep8' / 'dep16' historical names from the vLLM PR #1129 sibling for symmetry — the actual sglang_config is TP-only.

…ility at TP=8/16 After the DeepEP removal, model load crashed at: File '.../sglang/srt/layers/quantization/fp8.py', line 282, in validate_block_quant_shapes raise ValueError( ValueError: Weight output_partition_size = 192 is not divisible by weight quantization block_n = 128. DSV4-Pro's shared-experts gate_up_proj (intermediate ~1536) FP8-quants in 128-element blocks. With TP=8 the per-rank slice is 1536/8=192, which fails the divisibility check. PR #75 sidesteps this by using TP=4 (1536/4=384), but that locks us into single-node workers. sglang's --moe-dense-tp-size flag is the documented workaround (server_args.py: 'useful when, with large TP size, there are errors caused by weights in MLP layers having dimension smaller than the min dimension GEMM supports'). Setting moe-dense-tp-size: 1 runs the shared / dense-MLP layers replicated across ranks (TP=1) while the rest of the model — attention, routed experts — keeps TP=8/16. Memory cost is small since shared experts are a fraction of total weights. Applied to all 6 recipes; topology/node counts unchanged.

Run after moe-dense-tp-size: 1 added still hit: ValueError: Weight output_partition_size = 192 is not divisible by weight quantization block_n = 128. Verified in upstream sglang dp_attention.py (compute_dp_attention_local_info): if not enable_dp_attention: return tp_rank, tp_size, 0 # moe_dense_tp_size IGNORED The flag is only honored when enable_dp_attention=True. Since we already dropped DP-attention to avoid the fork's mxfp4_deepseek bug, moe-dense-tp-size: 1 was a no-op. Two valid paths: (a) re-enable DP-attention without DeepEP — speculative, never tested (b) drop to TP=4 — 1536/4=384 divides cleanly by 128, FP8 quant passes. Matches NVIDIA/srt-slurm PR #75 (the only verified- working DSV4 sglang disagg recipe upstream) verbatim. Going with (b). Recipes drop moe-dense-tp-size (no longer needed at TP=4) and switch tensor-parallel-size to 4 in both prefill+decode. gpus_per_prefill / gpus_per_decode drop to 4 (single GB200 node per worker). prefill_nodes / decode_nodes track worker counts. Topology shape (filenames keep historical dep8/dep16 naming for symmetry with the vLLM #1129 sibling; actual config is TP=4): - 1k1k 1p1d-tep8: P TP=4 / D TP=4 (2 nodes total) - 1k1k 1p1d-dep16: P TP=4 / D TP=4 (2 nodes total) — same shape, different conc - 1k1k 3p1d-dep16: P 3*TP=4 / D TP=4 (4 nodes) - 8k1k 1p1d-tep8: P TP=4 / D TP=4 (2 nodes) - 8k1k 3p1d-dep16: P 3*TP=4 / D TP=4 (4 nodes) - 8k1k 7p1d-dep16: P 7*TP=4 / D TP=4 (8 nodes) nvidia-master.yaml updated to match (tp: 4, ep: 1, dp-attn: false on every prefill+decode block — including the commented 8k/1k block). Also bumped SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK 1024 → 2048 in all env blocks (DeepEP path is dormant in this config but the env var is in place for re-enabling later).

* Day 0 DeepSeek V4 Pro FP4 GB200 disaggregated SGLang benchmarks * Drop unsupported backend.connector field from sglang recipes srtctl SrtConfig schema rejects backend.connector for the sglang backend type. The field was carried over from the dynamo-vllm dsv4 recipes (where it is valid and set to null). PR #69/#75 sglang recipes upstream do not declare it. * Drop dynamo: version: 0.8.1 — incompatible with deepseek-v4-grace-blackwell sglang fork Re-installing dynamo 0.8.1 over the lmsysorg/sglang:deepseek-v4-grace-blackwell container's pre-baked sglang fails at import time: File ".../dynamo/sglang/health_check.py", line 20 def _get_bos_token_id_from_engine(engine: Optional[sgl.Engine]) AttributeError: module 'sglang' has no attribute 'Engine' The DSV4 sglang fork bundled in this image does not expose sgl.Engine. Drop the dynamo: block so srtctl uses the dynamo build pre-installed in the container — matches NVIDIA/srt-slurm PR #75 (the only upstream DSV4 sglang disagg recipe), which also has no dynamo: block. * Add dynamo: install: false — srtctl default is install=True srtctl's DynamoConfig (src/srtctl/core/schema.py L680) defaults to install=True, which pip installs dynamo 0.8.0 even when no `dynamo:` block is specified. Use the explicit opt-out so srtctl uses the dynamo build baked into the lmsysorg/sglang:deepseek-v4-grace-blackwell image. This image's sglang fork doesn't expose sgl.Engine, which dynamo.sglang.health_check imports at top level — re-installing dynamo over it breaks startup. * Pin dynamo to v1.2.0-sglang-deepseek-v4-dev.1 tag (hash 21f135f5) install: false fixed the pip-install crash, but the lmsysorg/sglang:deepseek-v4-grace-blackwell image doesn't have dynamo pre-installed (ModuleNotFoundError: No module named 'dynamo'), so srtctl needs to install something compatible. The DSV4-targeted dynamo tag v1.2.0-sglang-deepseek-v4-dev.1 (sha 21f135f5edf40e12e6ff5db2b462d862a6d6ab9b) includes 'from __future__ import annotations' in dynamo/sglang/health_check.py (ai-dynamo PR #7255, commit cdb7218a, 2026-03-12), which makes the Optional[sgl.Engine] annotation lazy. The PyPI 0.8.0/0.8.1 releases predate that fix and crash with AttributeError on this image's sglang fork. * Force deepep-mode: low_latency to work around mxfp4+DeepEP normal-dispatch bug Prefill warmup crashed in run 24941291328 with: File ".../sglang/srt/layers/quantization/mxfp4_deepseek.py", line 347 topk_output = dispatch_output.topk_output AttributeError: 'DeepEPNormalDispatchOutput' object has no attribute 'topk_output' Per sglang server_args.py, --deepep-mode defaults to 'auto', which picks 'normal' for prefill batches and 'low_latency' for decode. The mxfp4_deepseek MoE kernel only handles the low_latency dispatch output shape (which carries topk_output); the normal-dispatch output type does not, so any prefill forward (or decode warmup using forward_idle) hits the AttributeError before the worker can serve. Force deepep-mode: low_latency on every prefill + decode block that uses moe-a2a-backend: deepep. The two 1p1d-dep8-tep8 decode blocks remain TP-only (no DeepEP) and are unaffected. Run reference: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24941291328 * Drop DeepEP / DP-attn / EP — fork-only mxfp4_deepseek bug, both dispatch types broken Run after the deepep-mode: low_latency change failed again. Logs show two distinct DeepEP-path failures: 1. Prefill scheduler crash: File '.../sglang/srt/layers/quantization/mxfp4_deepseek.py', line 347 topk_output = dispatch_output.topk_output AttributeError: 'DeepEPLLDispatchOutput' object has no attribute 'topk_output' The earlier crash had 'DeepEPNormalDispatchOutput' — neither dispatch output type in this image's sglang fork exposes topk_output, so forcing low_latency vs normal mode does not help. mxfp4_deepseek.py is a fork-only file (does not exist in upstream sgl-project/sglang), so the API mismatch can only be fixed by rebuilding the image. 2. Decode CUDA graph capture crash: RuntimeError: Failed: Assertion error /sgl-workspace/DeepEP/csrc/deep_ep.cpp:1233 'x.size(0) == topk_idx.size(0) and x.size(0) <= num_max_dispatch_tokens_per_rank' DeepEP low_latency_dispatch's per-rank token cap is exceeded by the cuda-graph-max-bs we configured. Both failures are in the DeepEP path. Per upstream sgl-project/sglang (server_args.py), moe_a2a_backend defaults to 'none', which uses all-reduce/all-gather dispatch and lets TP shard the expert weights across ranks (no separate EP needed). NVIDIA/srt-slurm PR #75 (the only upstream DSV4 sglang disagg recipe) takes the same TP-only stance — pure tensor-parallel-size: N with no enable-dp-attention, no moe-a2a-backend deepep, no dp-size, no ep-size. Drop those five fields from all 6 recipes. Topology shape preserved: - 1k1k 1p1d: P TP=8 / D TP=8 (4 nodes) - 1k1k 1p1d-wide: P TP=8 / D TP=16 (6 nodes) - 1k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes) - 8k1k 1p1d: P TP=8 / D TP=8 (4 nodes) - 8k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes) - 8k1k 7p1d-wide: P 7*TP=8 / D TP=16 (18 nodes) DSV4-Pro at MXFP4 (~340 GB) shards comfortably under TP=8 (~42 GB/rank) or TP=16 (~21 GB/rank) with mem-fraction-static: 0.82 leaving plenty of KV cache headroom on each 96 GB GB200 GPU. Topology filenames retain the 'dep8' / 'dep16' historical names from the vLLM PR #1129 sibling for symmetry — the actual sglang_config is TP-only. * Add moe-dense-tp-size: 1 — fix shared-experts FP8 block-quant divisibility at TP=8/16 After the DeepEP removal, model load crashed at: File '.../sglang/srt/layers/quantization/fp8.py', line 282, in validate_block_quant_shapes raise ValueError( ValueError: Weight output_partition_size = 192 is not divisible by weight quantization block_n = 128. DSV4-Pro's shared-experts gate_up_proj (intermediate ~1536) FP8-quants in 128-element blocks. With TP=8 the per-rank slice is 1536/8=192, which fails the divisibility check. PR #75 sidesteps this by using TP=4 (1536/4=384), but that locks us into single-node workers. sglang's --moe-dense-tp-size flag is the documented workaround (server_args.py: 'useful when, with large TP size, there are errors caused by weights in MLP layers having dimension smaller than the min dimension GEMM supports'). Setting moe-dense-tp-size: 1 runs the shared / dense-MLP layers replicated across ranks (TP=1) while the rest of the model — attention, routed experts — keeps TP=8/16. Memory cost is small since shared experts are a fraction of total weights. Applied to all 6 recipes; topology/node counts unchanged. * Set SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024 in all env blocks Belt-and-suspenders for the DeepEP per-rank dispatch buffer cap. The default is too low; with this set we'll have headroom if EP / DeepEP is re-enabled later (e.g., once the fork's mxfp4_deepseek dispatch API mismatch is fixed). 1024 matches the cookbook's B200 decode reference. * Switch to TP=4 single-node — match PR #75 verbatim, fix FP8 block-quant Run after moe-dense-tp-size: 1 added still hit: ValueError: Weight output_partition_size = 192 is not divisible by weight quantization block_n = 128. Verified in upstream sglang dp_attention.py (compute_dp_attention_local_info): if not enable_dp_attention: return tp_rank, tp_size, 0 # moe_dense_tp_size IGNORED The flag is only honored when enable_dp_attention=True. Since we already dropped DP-attention to avoid the fork's mxfp4_deepseek bug, moe-dense-tp-size: 1 was a no-op. Two valid paths: (a) re-enable DP-attention without DeepEP — speculative, never tested (b) drop to TP=4 — 1536/4=384 divides cleanly by 128, FP8 quant passes. Matches NVIDIA/srt-slurm PR #75 (the only verified- working DSV4 sglang disagg recipe upstream) verbatim. Going with (b). Recipes drop moe-dense-tp-size (no longer needed at TP=4) and switch tensor-parallel-size to 4 in both prefill+decode. gpus_per_prefill / gpus_per_decode drop to 4 (single GB200 node per worker). prefill_nodes / decode_nodes track worker counts. Topology shape (filenames keep historical dep8/dep16 naming for symmetry with the vLLM #1129 sibling; actual config is TP=4): - 1k1k 1p1d-tep8: P TP=4 / D TP=4 (2 nodes total) - 1k1k 1p1d-dep16: P TP=4 / D TP=4 (2 nodes total) — same shape, different conc - 1k1k 3p1d-dep16: P 3*TP=4 / D TP=4 (4 nodes) - 8k1k 1p1d-tep8: P TP=4 / D TP=4 (2 nodes) - 8k1k 3p1d-dep16: P 3*TP=4 / D TP=4 (4 nodes) - 8k1k 7p1d-dep16: P 7*TP=4 / D TP=4 (8 nodes) nvidia-master.yaml updated to match (tp: 4, ep: 1, dp-attn: false on every prefill+decode block — including the commented 8k/1k block). Also bumped SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK 1024 → 2048 in all env blocks (DeepEP path is dormant in this config but the env var is in place for re-enabling later). * Restore mi355x retry changelog entries clobbered by merge The merge of main into this branch (c0aec93) accidentally overwrote the two dsv4-fp8-mi355x-sglang retry entries (PR #1148 retry-pair tail and PR #1159 retry-pair) with duplicated copies of our own dsv4-fp4-gb200-dynamo-sglang entry. The process_changelog.py gate rejects deletions, so the workflow blocked. Restore the two mi355x entries verbatim from origin/main and keep a single copy of our dsv4 entry, appended after the restored mi355x block. perf-changelog.yaml diff vs origin/main is now additions-only. * Switch back to TP=8: enable-dp-attention + moe-dense-tp-size: 1, no moe-a2a-backend TP=4 OOMed — DSV4-Pro at MXFP4 doesn't fit on a single GB200 node. Need TP=8 across 2 nodes (768 GB total). But TP=8 trips two issues that earlier rounds papered over: a) shared-experts gate_up_proj FP8 block-quant divisibility (1536/8=192, not a multiple of block_n=128) b) the lmsysorg/sglang:deepseek-v4-grace-blackwell fork's mxfp4_deepseek kernel crashes on every DeepEP forward path Single combo that solves both — verified in upstream sglang source: * enable-dp-attention: true + moe-dense-tp-size: 1 Runs dense / shared-MLP layers replicated (TP=1) — fixes (a). moe-dense-tp-size IS gated on enable_dp_attention=True per python/sglang/srt/layers/dp_attention.py (compute_dp_attention_local_info ignores it when DP-attn is off). * NO moe-a2a-backend set (default 'none') Lands the model on forward_normal instead of forward_deepep — avoids (b). Verified in deepseek_v2.py: _enable_a2a_moe = is_deepep | is_mooncake | is_nixl | is_mori | is_ascend_fuseep | is_flashinfer With backend='none' this is False and forward_normal runs. Recipes: tensor-parallel-size 4 → 8 (both prefill+decode); add moe-dense-tp-size: 1, enable-dp-attention: true, dp-size: 8 to every sglang_config block; gpus_per_prefill / gpus_per_decode 4 → 8; prefill_nodes / decode_nodes scale to workers × 2. nvidia-master.yaml mirrors: tp 4 → 8, dp-attn false → true on every prefill+decode block (active 1k/1k + commented 8k/1k). Topology shape restored to: - 1k1k 1p1d-* : 4 nodes (was 2) - 1k1k 3p1d-* : 8 nodes (was 4) - 8k1k 1p1d-* : 4 nodes (commented) - 8k1k 3p1d-* : 8 nodes (commented) - 8k1k 7p1d-* : 16 nodes (commented) * Scope sweep to high-conc DeepEP only — temporarily comment 1p1d blocks Comment out the low-conc (1-64) and mid-conc (128-4096) search-space entries in nvidia-master.yaml so the sweep iterates only on the high- conc 3p1d-dep8-dep16 topology. Re-enable DeepEP on that one recipe to exercise the EP path: 3p1d-dep8-dep16 prefill+decode: + ep-size: 8 + moe-a2a-backend: "deepep" + deepep-mode: low_latency (kept enable-dp-attention + moe-dense-tp-size: 1 + tp=8 / dp=8) Master matrix label updated to ep=8 to reflect the recipe. Sibling 1p1d recipes on disk are unchanged (still TP=8 + DP-attn, no DeepEP). They are still referenced by the commented-out master entries — restore them by uncommenting. * tep fix + dep for high conc * sike no dpa * Cap SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK at 1024 — sglang LL hard ceiling DeepEP run (3p1d-dep8-dep16) crashed at: File '.../sglang/srt/layers/moe/token_dispatcher/deepep.py', line 325 assert self.num_max_dispatch_tokens_per_rank <= 1024 AssertionError _DeepEPDispatcherImplLowLatency enforces a hard upper bound of 1024 in low_latency mode. We had bumped the env var to 2048 to give headroom above the earlier C++ side cap (deep_ep.cpp:1233 'x.size(0) <= num_max_dispatch_tokens_per_rank'), but 2048 trips this Python-side assertion at scheduler init. 1024 is the exactly-allowed value: high enough to cover the cuda-graph-max-bs we use, low enough to satisfy the LL dispatcher constructor. Apply 2048 → 1024 across all 6 recipes (every prefill + decode env block). * Revert 3p1d-dep8-dep16 to no-DeepEP TP-only; uncomment full 1k/1k + 8k/1k sweep DeepEP is broken on the lmsysorg/sglang:deepseek-v4-grace-blackwell image — verified across three runs (deepep-mode auto/normal, deepep-mode low_latency, and the latest 3p1d try). All hit the fork-only mxfp4_deepseek.py:347 reading dispatch_output.topk_output, which neither DeepEPLLDispatchOutput nor DeepEPNormalDispatchOutput exposes in this fork. Cannot be fixed from the recipe — needs the image rebuilt with mxfp4_deepseek patched, or an upstream sglang fix. 3p1d-dep8-dep16 recipe: drop ep-size, moe-a2a-backend, deepep-mode from prefill+decode. Now matches the 1p1d siblings: TP=8 + DP=8 + moe-dense-tp-size: 1, default 'none' a2a backend (forward_normal path bypasses the buggy mxfp4_deepseek kernel). nvidia-master.yaml: * Uncomment the 1k/1k mid-conc and 8k/1k blocks (low + mid + high). * 3p1d-dep8-dep16 matrix label ep: 8 → ep: 1 to match recipe. Sweep now expands to 6 entries / 27 conc points (3 1k/1k + 3 8k/1k). * Try moe-a2a-backend: flashinfer on 3p1d-dep8-dep16 for high-conc EP DeepEP is dead in this image (mxfp4_deepseek.py:347 reads dispatch_output.topk_output, neither DeepEPNormal nor DeepEPLL output exposes that field). Smoke test the only other plausible EP backend upstream sglang offers: flashinfer. Per upstream docs/advanced_features/expert_parallelism.md, flashinfer is the documented option for 'Large-scale EP deployments' and uses a different dispatcher than DeepEP — its output class may or may not trip the same mxfp4_deepseek bug. Per server_args.py _handle_a2a_moe, flashinfer auto-sets SGLANG_MOE_NVFP4_DISPATCH=True and forces ep_size = tp_size, so we set ep-size: 8 explicitly. Everything else (TP=8 / DP=8 / moe-dense-tp-size: 1) stays so the FP8 block-quant path remains valid. Scope: 1k/1k 3p1d-dep8-dep16 only. If the EP path serves on this image, port back to the 1p1d siblings; if it crashes the same way DeepEP did, revert to the no-EP forward_normal path and accept the TP-only pareto. nvidia-master.yaml matrix labels for the 3p1d entry updated to ep=8 to match the recipe. * Revert flashinfer EP attempt — accept TP-only pareto, every EP backend dead on this image flashinfer EP smoke test (3p1d-dep8-dep16 1k/1k) crashed at startup: File '.../sglang/srt/server_args.py', line 2133, in _handle_a2a_moe assert self.moe_runner_backend in [...] AssertionError: Flashinfer MoE A2A is only supported with flashinfer_cutlass moe runner backend flashinfer_cutlass is FP8-only — won't load DSV4-Pro's MXFP4 weights. The only path that satisfies the assertion would also fail at model load. So flashinfer is unusable for DSV4 on any image that doesn't ship a flashinfer_mxfp4_cutlass runner (which doesn't exist). Combined with the earlier deepep failure (mxfp4_deepseek.py:347 AttributeError on dispatch_output.topk_output, both Normal and LL dispatch types), every EP backend sglang exposes in this image is dead. Remaining options (mooncake, nixl-ep, mori, ascend_fuseep) are either Ascend-NPU-only or not wired into this image. Revert 3p1d-dep8-dep16 recipe to no-EP TP-only (matches the 5 sibling recipes) and master.yaml matrix labels (ep: 8 → ep: 1). PR description's Known Issues section updated to a 4-row table covering every EP backend tried and accepted as dead end. * fix(sglang): bump 8k1k prefill max-running-requests from 4 to 8 sglang computes per-rank capacity as max_running_requests // dp_size. With dp-size=8, a value of 4 floors to 0, hitting the "max_running_request is zero" assertion in tp_worker.py:277. Bump to 8 so each DP rank gets at least 1 slot — matches the working 1p1d recipe. * ports * Dsv4 fp4 gb200 dynamo sglang disagg (#1213) * Modify deepseek-v4 configuration for new model settings * Update YAML configuration for deepseek model * adapt for model path, etc * dev * upd * fix * fix * test * add gb300 * upd * fix * fix * fix * fix(launch_gb300-cw): register deepseek-v4-pro alias in model_paths After fixing the recipe overlay path in 1b07108, srtctl now loads our hand-rolled SGLang recipe and runs preflight, which rejects: Error: Preflight failed for recipes/sglang/.../disagg-gb300-2p1d-dep4-dep8.yaml: - model.path: Model 'deepseek-v4-pro' is not a local model path and is not defined in srtslurm.yaml model_paths. Both `disagg-gb300-2p1d-dep4-dep8.yaml` and `disagg-gb300-7p1d-dep4-dep8.yaml` declare `model.path: deepseek-v4-pro` (per the recipe header comment, the alias is intentionally aligned with `launch_gb200-nv.sh`'s srtslurm.yaml, which exports `SRT_SLURM_MODEL_PREFIX=deepseek-v4-pro`). The gb300-cw launcher only registered `dspro` and `dsv4-pro`, so the alias never resolved. Add `deepseek-v4-pro` mapping to the same `${MODEL_PATH}`. * fix(launch_gb300-cw): pull arm64 squash and force fresh import per runner After fixing model.path alias (fe6815c), the slurm orchestrator reached the head infrastructure srun and crashed at: [ERROR] Invalid image format: /mnt/vast/squash_dupe/lmsysorg_sglang_deepseek-v4-grace-blackwell.sqsh error: pyxis: failed to create container filesystem error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 Two issues: 1. The runner pod that runs `enroot import docker://lmsysorg/sglang:...` is x86, so without `--arch` enroot fetches the amd64 manifest. The compute nodes (slurm-gb300-138-*) are aarch64 and pyxis there rejects the amd64 squash with "Invalid image format". Pass `--arch arm64` and tag the cache filename with `_arm64`. 2. `enroot import -o existing.sqsh ...` aborts with `[ERROR] File already exists` and leaves the stale file in place, so once a half-baked or pre-tag-update squash lands at this path it is silently reused on every subsequent CI run. Inspecting /mnt/vast/squash_dupe showed an Apr 26 amd64 sqsh shadowing the Apr 28 working arm64 sqsh exactly like this. `rm -f` before each import forces fresh downloads and picks up Docker tag updates. 3. Scope the squash filename per RUNNER_NAME (gb300-cw_0..3) so that the four matrix runners do not race on rm+import of the same shared path on /mnt/vast. Cost: ~64 GB on /mnt/vast (4 runners × 16 GB per arm64 sqsh) instead of 16 GB shared, which is fine on the shared VAST mount. * fix(launch_gb300-cw): use enroot --arch aarch64, not arm64 enroot 4.0.1's `common::debarch()` accepts kernel-style arch names (`x86_64`, `aarch64`, `ppc64le`) and emits Docker-style names (`amd64`, `arm64`, `ppc64le`) on the wire. Passing `--arch arm64` (the Docker manifest name) trips the function's else branch immediately: [ERROR] Unsupported architecture: arm64 Use the kernel name `aarch64` so enroot can map it to docker's `arm64` manifest internally. * fix(launch_gb300-cw): use pre-staged arm64 sqsh, drop in-CI enroot import Even with `--arch aarch64`, `enroot import` from the CI runner pod (x86) fails when converting the arm64 image: [INFO] Converting whiteouts... /usr/bin/bash: line 1: /usr/bin/enroot-aufs2ovlfs: Operation not permitted (repeated dozens of times, then preflight reports the sqsh as missing) `enroot-aufs2ovlfs` requires CAP_SYS_ADMIN that the runner pod doesn't hold, and `lmsysorg/sglang:deepseek-v4-grace-blackwell` is arm64-only, so the conversion can't be skipped either. Per the documented manual flow at https://gist.github.com/Fridge003/42c6001e0bb613acf0e411305b8ea780 the import has to be dispatched to an aarch64 GB300 compute node via `srun`. Rather than running an extra slurm job per CI invocation just to prepare the sqsh, point the launcher at the pre-staged arm64 sqsh that already lives at `/mnt/vast/squash_dupe/lmsysorg_sglang_deepseek-v4-grace-blackwell_arm64.sqsh` (refreshed manually via the gist script when the docker tag is bumped). The matching `nginx_1.27.4_arm64.sqsh` was symlinked alongside. Add a fast-fail check so a missing pre-staged sqsh produces a clear error instead of a confusing pyxis "Invalid image format" three steps later. * fix(launch_gb300-cw): persist dynamo wheel cache and ulimit preamble Two follow-up fixes after CI started successfully reaching slurm but the dynamo-from-source step (`dynamo: hash: 9d3c913d…`) is rebuilt cold on every CI run, taking ~10-20 minutes per matrix job: 1. Cluster-wide dynamo wheel cache. srtctl's `_hash_cached_source_install` (`src/srtctl/core/schema.py:912`) is already designed to cache hash-pinned builds at `/configs/dynamo-wheels/<hash>/{ai_dynamo_runtime-*.whl,dynamo-src.tar.gz,.complete}` under flock. The cache only works if `/configs/dynamo-wheels` survives between CI runs, but the launcher does `rm -rf srt-slurm` and re-clones every time, blowing it away. Mount `/mnt/vast/dynamo-wheels-cache` (NFS, shared by every gb300-cw_N runner) over `/configs/dynamo-wheels` via srtslurm.yaml `default_mounts`, so the cache survives `rm -rf` and is shared across all matrix jobs. After the first cold build the warm path should drop dynamo install to ~30 s. 2. Cluster-wide bash preamble for ulimits. yangminl's manual setup on this cluster (`/mnt/home/yangminl/srt-slurm/srtslurm.yaml`) sets `default_bash_preamble: "ulimit -n 1048576 && ulimit -a"` so the dynamo frontend / sglang servers can accept the 8192-concurrency sweep without `EMFILE: too many open files`. Mirror that here. The feature is supported by srtctl's pinned commit (`src/srtctl/core/slurm.py:_get_cluster_bash_preamble`). * fix(sglang/dsv4/8k1k recipes): set cpus-per-task=144 for dynamo build slurm assigns 1 CPU/task by default; `scontrol show job <id>` from a recent CI run shows `NumCPUs=4 NumTasks=4 CPUs/Task=1` with 4 nodes, i.e. one core per worker. The dynamo `hash:` source install rebuilds ~500 rust crates (kube-client, tonic, hf-hub, image codecs ravif/exr, pyo3 stack) and at one core takes 30+ min just for the cold build, which dominates total CI time even with the new `/configs/dynamo-wheels` cache (the cache only helps after the first cold run). Match yangminl's working manual setup (`/mnt/home/yangminl/srt-slurm/recipes/dsv4-pro/sglang/gb300-fp4/all-dynamo.yaml`) which sets `sbatch_directives.cpus-per-task: "144"` so cargo gets the full GB300 host (144 cores) and finishes maturin in a few minutes. * fix(sglang/dsv4/8k1k recipes): set cpus-per-task=144 and mem=0 slurm assigns 1 CPU/task by default; `scontrol show job 613` from a running CI job confirmed `NumCPUs=4 NumTasks=4 CPUs/Task=1` with 4 nodes — one core per worker. The dynamo `hash:` cold source install rebuilds ~500 rust crates (kube-client, tonic, hf-hub, image codecs ravif/exr, the pyo3 stack) and at one core takes 30+ min just for the cold build, which dominates total CI time even with the new `/configs/dynamo-wheels` cache (the cache only helps after the first cold run). Match yangminl's working manual setup on the same gb300-cw cluster (`/mnt/home/yangminl/srt-slurm/recipes/dsv4-pro/sglang/gb300-fp4/all-dynamo.yaml`) which sets: sbatch_directives: cpus-per-task: "144" mem: "0" cargo then gets the full 144-core GB300 host and finishes maturin in a few minutes; mem=0 hands the worker the entire node's RAM so the dynamo build + DSV4-Pro 671B FP4 weight load fit without OOM. * fix(launch_gb300-cw): pin srt-slurm fork with parallel sa-bench The current sa-bench in NVIDIA/srt-slurm@9d75f82 generates random prompts single-threaded, which dominates 7p1d/conc=8192 bench startup (~50 min just for the 81920-prompt main pass before the first HTTP request reaches dynamo). Pin to fzyzcjy/srt-slurm fork branch `feat/random-num-workers` (commit 8094cfb), which is 9d75f82 + the SemiAnalysisAI/InferenceX `utils/bench_serving/` benchmark_serving.py ported into sa-bench. With `--random-num-workers 48` (now the default in bench.sh) prompt generation drops to ~1 min on a 144-core GB300 host, putting the bench-startup cost on the same order as infra+model-load instead of dominating it. The fork is paired with the upstream PR NVIDIA/srt-slurm#114; once that merges, this pin should revert to the bumped NVIDIA/srt-slurm SHA. * fix(launch_gb300-cw): bump srt-slurm fork pin to minimal multiproc patch Previous pin (8094cfb) was a wholesale replacement of sa-bench with the SemiAnalysisAI/InferenceX bench_serving — that dropped `async_request_dynamo_completions` from `ASYNC_REQUEST_FUNCS`, so `bench.sh` would have died on `--backend dynamo` argparse rejection the moment the bench client started. New pin (4249d16) is a tight ~100-line patch on top of NVIDIA/srt-slurm@9d75f82 that only adds parallel random prompt generation (`--random-num-workers`); everything else, including the dynamo backend and `--custom-tokenizer` plumbing, stays exactly the same as upstream. See NVIDIA/srt-slurm#114. * ci: temporarily comment out conc-list:[64] 2p1d entry Focus CI on the conc=8192 7p1d max-throughput entry only — re-enable the 2p1d/conc=64 mid-curve entry shortly once that's green. * ci(eval): temporarily skip dsv4-fp4-gb300 dynamo-sglang eval-only entry The srt-slurm pin (9d75f82, recipes/dsv4-agg-disagg) lacks the lm-eval orchestrator path that lives on sa-submission-q2-2026. Skip the auto-generated eval-only matrix entry for this config until the pin is bumped. TODO: remove this branch once the pin is moved to sa-submission-q2-2026 (which already carries the EVAL_ONLY do_sweep.py branch and lm-eval/bench.sh). * bench(7p1d-dep4-dep8): swap sa-bench default for yangminl's gb300-cw recipe Replace the sa-bench builder (concurrencies=8192, req_rate=inf, sa-bench default num_prompts/num_warmups multipliers) with the exact custom command from yangminl's gb300-cw 8k1k_hightpt[0] run (slurm job 564 on the dsv4-pro-gb300-fp4 cluster): concurrency=4096, rate=48, num_prompts=40960, num_warmups=512, random_num_workers=96. Why mirror those exact knobs: that recipe is what produced the 7p1d reference numbers we benchmarked against (358K total tok/s, 39.9K output tok/s, ~5s mean TTFT). Running sa-bench at concurrency=8192/rate=inf will saturate the 1-decode-worker GPU (we observed 16384 concurrency on job 617 saturated decode at ~390 running/rank with mean TTFT ~257s, i.e. equilibrium gated by decode compute, not the bench), making the result not directly comparable. Bench framework note: the fzyzcjy fork's benchmark_serving.py / benchmark_utils.py / encoding_dsv4.py are byte-identical to upstream SemiAnalysisAI/InferenceX/main; only backend_request_func.py adds five per-request debug print sites (ok=/lat=/url=/plen=/err=). Throughput numbers should match sa-bench at the same flags; the fork is chosen here to keep parity with the reference run's logs. Skipped on purpose: - DeepGEMM env knobs (SGLANG_DG_CACHE_DIR / SGLANG_JIT_DEEPGEMM_PRECOMPILE vs SGLANG_JIT_DEEPGEMM_FAST_WARMUP=1) — yangminl's cache dir is /configs/deepgemm_cache on the gb300-cw host and isn't portable here; PR's FAST_WARMUP path stays. - expert_location_dispatch.py topk_ids int32 cast (yangminl commits 94b7dc4c7 + e933ef2b1 on the patched sglang fork) — not pulling that into the container build. * config(7p1d-dep4-dep8): align with job 564 — multi-frontend, sbatch dirs, name Eliminate every non-cluster-specific diff vs job 564's resolved config (`/outputs/564/config_8k1k_hightpt_0.yaml`): - name: match `dsv4-pro-gb300-fp4_8k1k_hightpt_0` (was stale gb200 string) - frontend.enable_multiple_frontends: false → true; add num_additional_frontends: 8 (job 564 ran 9 dynamo frontends behind nginx; PR was running a single frontend, which was a real router-side runtime diff) - slurm.time_limit: 8h → 3h to match job 564 - sbatch_directives.cpus-per-task: 144, mem: 0 (portable, was missing) - drop health_check block (job 564 doesn't set it; rely on srtctl default) Remaining diffs vs job 564 are all either cluster-specific path bindings (slurm.partition=hpc-mid, frontend.nginx_container, extra_mount of yangminl's patched sglang) or DG-cache env (SGLANG_DG_CACHE_DIR / SGLANG_JIT_DEEPGEMM_PRECOMPILE) — those need InferenceX-cluster-side equivalents and are documented in the header comment. * config(7p1d-dep4-dep8): keep PR name field, revert to original * upd * fix * fix * middle * fi * fix * upd * fix * upd --------- Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Cheng Wan <chwan@rice.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com>

improve CI stability with docker

d1dc8c4

kedarpotdar-nv requested a review from functionstackx September 30, 2025 23:41

kedarpotdar-nv added 2 commits September 30, 2025 16:54

add comment on docker init

1d19374

add comments

d825e07

functionstackx merged commit 0027aad into main Oct 1, 2025

functionstackx deleted the nvidia-ci-stability branch October 1, 2025 00:06

cquil11 added the NVIDIA label Apr 8, 2026

Oseltamivir mentioned this pull request Apr 25, 2026

Day 0 DeepSeek V4 Pro FP4 GB300 disaggregated SGLang benchmarks #1157

Merged

5 tasks

Oseltamivir mentioned this pull request Apr 26, 2026

SGL GB300 Day 0 DSV4 FP4 disagg #1169

Open

5 tasks

Oseltamivir added a commit that referenced this pull request Apr 26, 2026

match upstream PR #75 tunings + skip srtctl dynamo install

fa52ab0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve NVIDIA CI stability#75

improve NVIDIA CI stability#75
functionstackx merged 3 commits intomainfrom
nvidia-ci-stability

kedarpotdar-nv commented Sep 30, 2025 •

edited

Loading

Uh oh!

kedarpotdar-nv commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kedarpotdar-nv commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

kedarpotdar-nv commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kedarpotdar-nv commented Sep 30, 2025 •

edited

Loading