Day 0 DeepSeek V4 Pro FP4 GB200 disaggregated vLLM benchmarks by Oseltamivir · Pull Request #1129 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-04-24T04:12:26Z

Summary

Adds dsv4-fp4-gb200-dynamo-vllm for DeepSeek-V4-Pro on GB200 (Dynamo + vLLM disagg). Currently runs only the 1k/1k sweep — the 8k/1k block sits commented out in nvidia-master.yaml to keep sweep-enabled runtime bounded; uncomment to re-enable.

Active sweep

Topology	Conc	Nodes	Source
`1p1d-dep8-tep8`	1, 4, 8, 16, 32, 64	4	Mirrored from NVIDIA/srt-slurm PR #71 (branch `aflowers/gb200-dsv4-recipes`, file `recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-1p1d-dep8-tep8.yaml`). Local deltas: `numa-bind` removed (our srt-slurm clone doesn't ship the `vllm_numa_bind_hash_fix.py` patch); `benchmark.tokenizer_mode` + `use_chat_template: true` dropped (need PR #68 sa-bench tokenizer support our pinned srtctl version doesn't have). CPU/DRAM offload kept — load-bearing (without it prefill OOMs at `Available KV cache memory: -16 GiB`).
`1p1d-dep8-dep16`	256, 512, 1024, 2048, 3072, 4096	6	Hand-rolled. No DSV4-Pro vLLM disagg precedent at 1k/1k upstream; structure follows `kimi-k2.5/1k1k/disagg-gb200-1p1d-dep4-dep16.yaml` scaled to DSV4-Pro's DP>=8 minimum.
`3p1d-dep8-dep16`	4096, 8192	10	Hand-rolled. Adds prefill capacity for the high-conc tail (single prefill saturates ~conc 4096 at 1k prompts). 4096 overlap with the 1p1d-dep16 entry gives a direct A/B at the topology-crossover point.

11 benchmark points across 3 cluster startups for 1k/1k. The commented 8k/1k block has 3 corresponding entries ready when re-enabled.

Files

.github/configs/nvidia-master.yaml — new sweep config keyed dsv4-fp4-gb200-dynamo-vllm
benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/{1k1k,8k1k}/*.yaml — recipe YAMLs (overlaid onto the upstream srt-slurm checkout at runtime)
runners/launch_gb200-nv.sh — dsv4 model-prefix branch + recipe-overlay step (cp -rT so an upstream stub directory wouldn't nest under ours, per claude bot review)
perf-changelog.yaml — sweep changelog entry

Recipe-reminder bot response

Low-conc TEP recipe is byte-for-byte mirrored from NVIDIA/srt-slurm PR #71 (branch aflowers/gb200-dsv4-recipes). The two local deltas (numa-bind off; sa-bench tokenizer fields off) are environment-driven, not optimization-driven.
Mid/high-throughput recipes are hand-rolled because no upstream DSV4-Pro vLLM disagg recipe exists at 1k/1k; structure tracks kimi-k2.5/1k1k/*.yaml from the same upstream branch. Each recipe header explicitly cites its NVIDIA reference and lists the deltas.

Test plan

python3 utils/matrix_logic/generate_sweep_configs.py test-config --config-keys dsv4-fp4-gb200-dynamo-vllm expands to 3 entries / 11 conc points
srtctl SrtConfig.from_yaml(...) validates all four recipe YAMLs
bash -n runners/launch_gb200-nv.sh passes
perf-changelog.yaml diff vs main is additions-only (gate passes)
Low-conc 1p1d-dep8-tep8 reproduces the upstream OOM symptom when offload is dropped, and clears it when offload is re-added — the offload knobs are load-bearing for this topology

Adapts the kimi-k2.5 dynamo-vllm 8k1k 5p1d-dep4-dep8 recipe to DeepSeek V4 Flash. Recipes live under srt-slurm-recipes/ and are copied into the srt-slurm checkout at runtime since the upstream NVIDIA/srt-slurm repo doesn't ship DSV4 recipes. - New config key: dsv4-fp4-gb200-dynamo-vllm - Image: vllm/vllm-openai:deepseekv4-cu130 - Model: deepseek-ai/DeepSeek-V4-Flash - Model path on cluster: /mnt/lustre01/users/sa-shared/DeepSeek-V4-Flash

github-actions · 2026-04-24T04:12:33Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude

Straightforward benchmark config addition following the established kimik2.5 pattern; the inline nit about cp -r semantics is non-blocking defensive hardening.

Extended reasoning...

Overview

This PR adds a new dsv4-fp4-gb200-dynamo-vllm benchmark configuration for DeepSeek V4 Flash on GB200 via Dynamo + vLLM. It touches four files: a new entry in .github/configs/nvidia-master.yaml, a changelog entry in perf-changelog.yaml, a new dsv4 branch in runners/launch_gb200-nv.sh (mirroring the existing kimik2.5 pattern for both model-path resolution and srt-slurm checkout), and a new recipe YAML at srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml.

Security risks

None. This is benchmark-infrastructure config — no auth, crypto, permissions, or user-input handling. The shell additions use already-exported env vars consistent with surrounding code, no new injection surfaces.

Level of scrutiny

Low. This is a self-contained benchmark recipe addition that follows an established pattern (compare against the kimik2.5 dynamo-vllm and dynamo-trt branches already in the same script). The new shell branch is a near-clone of the kimik2.5 dynamo-trt arm; the YAML recipe is a parameter-tuned adaptation of the existing kimi-k2.5 5p1d-dep4-dep8 recipe; the changelog entry matches the project's format.

Other factors

The one inline bug is a nit about cp -r src dst semantics flipping if upstream ever adds a recipes/vllm/deepseek-v4/ directory to the pinned sa-submission-q2-2026 branch — defensive hardening worth doing but not blocking, since today's behavior is correct and the trigger is hypothetical. The recipe-reminder bot comment is the standard automated nudge, not a blocking review.

claude · 2026-04-24T04:21:42Z

+if [[ $FRAMEWORK == "dynamo-vllm" && $MODEL_PREFIX == "dsv4" ]]; then
+    git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
+    cd "$SRT_REPO_DIR"
+    git checkout sa-submission-q2-2026
+    cp -r "$GITHUB_WORKSPACE/srt-slurm-recipes/vllm/deepseek-v4" recipes/vllm/deepseek-v4


🟡 The cp -r "$GITHUB_WORKSPACE/srt-slurm-recipes/vllm/deepseek-v4" recipes/vllm/deepseek-v4 works today only because the upstream sa-submission-q2-2026 branch does not contain a recipes/vllm/deepseek-v4/ directory. If upstream ever adds even a stub deepseek-v4 dir, cp -r src dst will copy into the existing dst, producing recipes/vllm/deepseek-v4/deepseek-v4/8k1k/..., and CONFIG_FILE in nvidia-master.yaml will silently resolve to the upstream stub instead of the InferenceX-controlled recipe. Trivial fix: cp -r src/. dst/, cp -rT src dst, or rm -rf dst && cp -r src dst.

Extended reasoning...

What the bug is

runners/launch_gb200-nv.sh:140-144 copies the InferenceX-owned recipe into the freshly cloned srt-slurm checkout:

git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR" cd "$SRT_REPO_DIR" git checkout sa-submission-q2-2026 cp -r "$GITHUB_WORKSPACE/srt-slurm-recipes/vllm/deepseek-v4" recipes/vllm/deepseek-v4

cp -r src dst has dual semantics:

if dst does not exist → dst is created as a copy of src (file lands at dst/8k1k/...)

if dst does exist as a directory → src is copied into dst (file lands at dst/deepseek-v4/8k1k/...)

Today the copy works because the pinned upstream branch does not contain recipes/vllm/deepseek-v4/.

Step-by-step proof

PR description states "The upstream NVIDIA/srt-slurm (sa-submission-q2-2026 branch) doesn't ship DSV4 recipes" — the maintainers explicitly acknowledge upstream may eventually add them.

Suppose upstream adds even a stub recipes/vllm/deepseek-v4/README.md to that branch (or any other content under that path).

After git checkout sa-submission-q2-2026, the directory recipes/vllm/deepseek-v4/ now exists in the working tree.

cp -r "$GITHUB_WORKSPACE/srt-slurm-recipes/vllm/deepseek-v4" recipes/vllm/deepseek-v4 then copies the source directory itself into the existing destination, producing:
recipes/vllm/deepseek-v4/deepseek-v4/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml

nvidia-master.yaml declares CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml. srtctl apply reads that path, finds either nothing (the InferenceX recipe is now buried one level too deep) or whatever upstream stub happens to live at that location.

The benchmark either fails noisily (best case) or silently runs an upstream/incomplete recipe (worst case), and the InferenceX-controlled YAML in this PR becomes dead code.

Why existing code does not prevent this

The script does rm -rf "$SRT_REPO_DIR" before cloning, which removes any previous run's state — but it does not protect against fresh upstream content arriving via git checkout. The branch is pinned by name, not by SHA, so it remains a moving target. A single upstream commit on sa-submission-q2-2026 adding any content under recipes/vllm/deepseek-v4/ is sufficient to flip the cp semantics.

Addressing the refutation

The refutation argues this is purely hypothetical and that any upstream stale recipe would itself surface as a recipe-conflict that breaks the run loudly. Two responses:

The trigger is plausible and explicitly anticipated by the PR description ("upstream doesn't ship DSV4 recipes" implies it could). The companion kimik2.5-fp4-gb200-dynamo-vllm PR also pulls from the same branch, so the project is already coupled to upstream evolution there.

"Breaks loudly" is not guaranteed — if upstream adds a directory at that path without a YAML at 8k1k/disagg-gb200-5p1d-dep4-dep8.yaml, srtctl will get a missing-file error (loud), but if upstream adds any YAML at that exact path (which is the normal pattern as DSV4 support is upstreamed), srtctl will silently consume the stale upstream content and the benchmark will run with the wrong configuration. The InferenceX-controlled recipe will be silently shadowed.

Fix

Any of the following make the copy deterministic regardless of upstream state:

# Option 1: copy contents into dst (creates dst if missing, merges otherwise) cp -r "$GITHUB_WORKSPACE/srt-slurm-recipes/vllm/deepseek-v4/." recipes/vllm/deepseek-v4/ # Option 2: GNU coreutils -T flag treats dst as the final name cp -rT "$GITHUB_WORKSPACE/srt-slurm-recipes/vllm/deepseek-v4" recipes/vllm/deepseek-v4 # Option 3: nuke then copy rm -rf recipes/vllm/deepseek-v4 && cp -r "$GITHUB_WORKSPACE/srt-slurm-recipes/vllm/deepseek-v4" recipes/vllm/deepseek-v4

This is defensive hardening — today's behavior is correct — so flagging as nit rather than blocking.

Adds two 1k/1k vLLM disagg recipes extrapolated from kimi-k2.5/1k1k (scaled to DSV4-Pro's DP>=8-per-worker constraint): * disagg-gb200-1p4d-dep8-dep8.yaml — interactivity (conc 4-128), 10 nodes * disagg-gb200-1p1d-dep8-dep16.yaml — mid/high throughput (conc 256-4096), 6 nodes Per-recipe tuning vs our 8k/1k baseline: * max-model-len 3072 (matches kimi 1k/1k) * prefill max-num-seqs 16 (fills 16384-token budget at 1k per seq) * decode max-num-seqs 128/512 (shorter KV -> more parallelism) nvidia-master.yaml changes: * Adds the 1k/1k seq-len-config with conc-lists stripped of 4/16/32 * Comments out the entire 8k/1k block so sweep-enabled runs don't re-trigger 8k/1k while 1k/1k numbers are collected. Re-enable by uncommenting (instructions at the top of the block).

Previous run reported "Model did not get healthy in 1800 seconds" on the 1k/1k 1p4d-dep8-dep8 recipe despite health_check.max_attempts being set to 720. 1800s is the srtctl default, so our override either wasn't applied or wasn't enough in the face of a cold-cache Lustre load. Double-down: * health_check.max_attempts: 720 -> 1440 (1800s -> 14400s = 4 hours) * slurm.time_limit: 8:00:00 explicit (srtslurm.yaml default is 6h, make it even wider so the SLURM wall clock can't cut off a slow load) Applied to all five recipes (1k/1k x2 and 8k/1k x3) so the fix carries over when the 8k/1k block in nvidia-master.yaml is re-enabled.

…-vllm-disagg

Replaces our hand-rolled 8k/1k DSV4-Pro vLLM disagg recipes with the four topologies from NVIDIA/srt-slurm PR #71 (source fork: alec-flowers/srt-slurm, branch aflowers/dsv4-pr67-pr68, pinned at commit d60e3f1c). PR #71 supersedes PR #67 that our original 8k/1k recipes were based on, with more topologies, a wider concurrency sweep per recipe, new env vars, explicit tokenizer-mode, and CPU/DRAM expert offload. We take everything except offload: * launch_gb200-nv.sh clones alec-flowers/srt-slurm for dsv4 instead of NVIDIA/srt-slurm. * Runtime post-clone patch strips `offload-group-size`, `offload-num-in-group`, `offload-prefetch-step`, and the commented `# offload-params` line from all four 8k/1k recipes. * Same post-clone patch injects our `slurm.time_limit: 8:00:00` and `health_check: {max_attempts: 1440, interval_seconds: 10}` (4 h budget) so the recipes match our cold-cache Lustre load budget. * Model-path alias changed from `deepseek-v4-pro` to `deepseekv4-fp4` to match PR #71 recipes' `model.path` field; 1k/1k local recipes updated to the same alias. * nvidia-master.yaml 8k/1k block rewritten: 4 search-space entries (1p1d-dep8-dep8, 3p1d-dep8-dep8, 3p1d-dep8-dep16, 6p1d-dep8-dep16), each running conc list [4, 8, 16, 32, 64, 256, 512, 1024] — 32 total 8k/1k benchmark points across 4 cluster startups. * Obsolete local 8k/1k recipes under srt-slurm-recipes/vllm/deepseek-v4/8k1k/ removed (superseded by the PR #71 upstream files). 1k/1k sweep is unchanged otherwise (2 matrix entries, 9 benchmark points using the hand-rolled recipes — no PR #71 equivalent at 1k/1k).

…k DSV4" This reverts commit 768cddc.

The existing 1k/1k 1p1d-dep8-dep16 recipe runs out of prefill at conc>=8192 — single DP=8 prefill worker can sustain ~80-150K tok/s, not the ~200-300K tok/s of demand at conc=8192. New 3p1d-dep8-dep16 recipe adds 2 more prefill workers (10 nodes total). Decode capacity bumped to max-num-seqs=1024 (vs 512 in 1p1d) so conc=8192 has headroom (per-rank 8192/16 = 512, well below 1024). max-cudagraph-capture-size kept at 512 — steady-state per-rank batch is ~512 so cudagraphs still apply. conc-list overlap at 4096 between the two topologies gives a direct crossover comparison point.

Recipes are part of the multi-node benchmark plumbing — they belong next to the other multi-node assets (amd_utils/, dsr1_*_sglang-disagg.sh, gptoss_fp4_gb200_dynamo-trt.sh) rather than at the repo root. Updates the launch script's `cp -r` source path. The reference in perf-changelog.yaml's historical entry is left untouched (additions-only gate; it's only a description string).

Decode workers use TP=8 within each worker (no data-parallel decode), sheds attention-layer memory pressure compared to the dep8-dep8 sibling at the cost of an inter-rank TP all-reduce per attention layer. Each rank holds: * dep8 sibling: full attention replica + 1/8 of experts (EP=8) * tep8 (this): 1/8 of attention (TP=8 sharded) + 1/8 experts (EP=8) Same node count (10) and same conc-list as the dep8-dep8 sibling so the two are directly comparable. Useful at low concurrency where TP all-reduce overhead is a smaller fraction of step time. Topology pattern derived from kimi-k2.5/{1k1k,8k1k}/disagg-gb200-1p4d- dep4-tep4.yaml (the only vLLM disagg TEP precedent on GB200 in upstream srt-slurm). Scaled to TP=8 because DSV4-Pro's attention layers don't fit the per-rank budget at TP=4. nvidia-master.yaml: * Adds the 1k/1k TEP entry as a sibling to the existing dep8-dep8 entry (same conc-list [8, 64, 128], active). * Adds the 8k/1k TEP entry inside the still-commented 8k/1k block (conc-list [8, 128]) so it's present when 8k/1k is re-enabled.

Reverts the experimental TEP-decode variant for low concurrency. Removes both 1k/1k and 8k/1k recipe files plus the active 1k/1k search-space entry and the (still-commented) 8k/1k entry in nvidia-master.yaml. Reverts the 'Interactivity (DP-decode)' / 'Interactivity (TEP-decode)' naming back to plain 'Interactivity' on the dep8-dep8 entries.

Mirrors the NVIDIA-official TEP recipe for very low concurrency: https://github.com/NVIDIA/srt-slurm/blob/aflowers/gb200-dsv4-recipes/ recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-1p1d-dep8-tep8.yaml Topology: 1 prefill (DP=8) + 1 decode (TP=8) — 4 nodes. Adds 1k/1k sibling (no upstream equivalent) by shrinking max-model-len to 3072. Local deviations from upstream (documented in recipe headers): * model.path renamed deepseekv4-fp4 -> deepseek-v4-pro to match our launch script's SRT_SLURM_MODEL_PREFIX. * Stripped CPU/DRAM offload knobs and numa-bind (our pinned NVIDIA/srt-slurm@sa-submission-q2-2026 clone doesn't ship the vllm_numa_bind_hash_fix.py patch upstream uses). * benchmark.use_chat_template: false (no PR #68 sa-bench changes in our srtctl); benchmark.tokenizer_mode dropped for the same reason. * Container kept on the floating tag; health_check + slurm.time_limit added for cold-cache Lustre loads. Replaces the 1p4d-dep8-dep8 low-conc entries (10-node, 4 decode workers) with this 4-node TEP topology in both 1k/1k (active) and 8k/1k (still commented). Deletes the now-unused 1p4d-dep8-dep8 recipe files. Active 1k/1k sweep: 3 entries / 14 benchmark points.

Last run failed with "Available KV cache memory: -15.99 GiB" on every prefill rank — model weights + activations alone exceed the gpu-memory-utilization=0.8 budget by ~16 GB at DP=8 (full attention replicated per rank + 1/8 of FP4 experts). The upstream recipe ships with offload precisely to free that ~16 GB by spilling MoE expert weights to host DRAM. Restores the three offload knobs on prefill in both 1k/1k and 8k/1k: offload-group-size: 3 offload-num-in-group: 1 offload-prefetch-step: 2 numa-bind: true is still excluded — needs the configs/patches/vllm_numa_bind_hash_fix.py patch that our pinned NVIDIA/srt-slurm@sa-submission-q2-2026 clone doesn't ship. Offload works without it (just slower host-side bandwidth).

Oseltamivir · 2026-04-25T02:19:04Z

Current vllm/vllm-openai:deepseekv4-cu130 image index digest: sha256:2e05966d05579729137714d16035f0cb3b9f0fc1586fbd009753868ee9afc68b

* runners/launch_gb200-nv.sh: switch the recipe overlay step from `cp -r src dst` to `cp -rT src dst` (with explicit `mkdir -p dst` first). Addresses the bot review nit at line 144 — `cp -r src dst` works only because the upstream sa-submission-q2-2026 branch has no `recipes/vllm/deepseek-v4/` directory today; if upstream ever ships one, `cp -r` would nest as `recipes/vllm/deepseek-v4/deepseek-v4/...` and CONFIG_FILE in nvidia-master.yaml would silently resolve to the upstream stub. `-T` overlays unconditionally. * perf-changelog.yaml: refresh the dsv4-fp4-gb200-dynamo-vllm entry's description. The previous wording referenced "8k1k, 7p1d-dep8-dep16" and "Mirrors NVIDIA/srt-slurm PR #67" which is stale after the move to a 1k/1k sweep with TEP low-conc (mirrored from PR #71) plus two hand-rolled mid/high topologies. Also fixes the directory reference (recipes moved to benchmarks/multi_node/srt-slurm-recipes/ during the cleanup pass).

When the 8k/1k block was uncommented, every line landed two spaces too deep — the block became a child of the 1k/1k entry's search-space list instead of a sibling under seq-len-configs. process_changelog.py's pydantic check caught this: seq-len-configs.0.search-space.3.prefill: Field required seq-len-configs.0.search-space.3.isl: Extra inputs are not permitted (The validator was reading the 8k/1k entry as a 4th search-space item that lacked prefill/decode and had stray isl/osl fields.) Dedented the entire 8k/1k block by 2 spaces. Schema validates, matrix expansion produces 6 entries / 24 benchmark points across 1k/1k + 8k/1k.

…truth) The workflow only exports CONFIG_FILE to srtctl and doesn't rewrite the recipe's benchmark.concurrencies block — so what actually runs is determined by the recipe, while the matrix conc-list only drives job naming and result aggregation. When the two disagree the matrix labels end up wrong (some advertised concs never run; runs land under mismatched labels). Two mismatches caught by audit: 1k/1k 1p1d-dep8-dep16: matrix [256, 512, 1024, 2048, 3072, 4096] -> [128, 256, 1024, 2048, 4096] recipe stays 128x256x1024x2048x4096 8k/1k 7p1d-dep8-dep16: matrix [2048, 4096] -> [4096, 8192] recipe stays 4096x8192 Picked recipe-side as the source of truth so the recipes stay self-consistent; matrix labels now reflect what srtctl will actually run.

…tch types broken Run after the deepep-mode: low_latency change failed again. Logs show two distinct DeepEP-path failures: 1. Prefill scheduler crash: File '.../sglang/srt/layers/quantization/mxfp4_deepseek.py', line 347 topk_output = dispatch_output.topk_output AttributeError: 'DeepEPLLDispatchOutput' object has no attribute 'topk_output' The earlier crash had 'DeepEPNormalDispatchOutput' — neither dispatch output type in this image's sglang fork exposes topk_output, so forcing low_latency vs normal mode does not help. mxfp4_deepseek.py is a fork-only file (does not exist in upstream sgl-project/sglang), so the API mismatch can only be fixed by rebuilding the image. 2. Decode CUDA graph capture crash: RuntimeError: Failed: Assertion error /sgl-workspace/DeepEP/csrc/deep_ep.cpp:1233 'x.size(0) == topk_idx.size(0) and x.size(0) <= num_max_dispatch_tokens_per_rank' DeepEP low_latency_dispatch's per-rank token cap is exceeded by the cuda-graph-max-bs we configured. Both failures are in the DeepEP path. Per upstream sgl-project/sglang (server_args.py), moe_a2a_backend defaults to 'none', which uses all-reduce/all-gather dispatch and lets TP shard the expert weights across ranks (no separate EP needed). NVIDIA/srt-slurm PR #75 (the only upstream DSV4 sglang disagg recipe) takes the same TP-only stance — pure tensor-parallel-size: N with no enable-dp-attention, no moe-a2a-backend deepep, no dp-size, no ep-size. Drop those five fields from all 6 recipes. Topology shape preserved: - 1k1k 1p1d: P TP=8 / D TP=8 (4 nodes) - 1k1k 1p1d-wide: P TP=8 / D TP=16 (6 nodes) - 1k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes) - 8k1k 1p1d: P TP=8 / D TP=8 (4 nodes) - 8k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes) - 8k1k 7p1d-wide: P 7*TP=8 / D TP=16 (18 nodes) DSV4-Pro at MXFP4 (~340 GB) shards comfortably under TP=8 (~42 GB/rank) or TP=16 (~21 GB/rank) with mem-fraction-static: 0.82 leaving plenty of KV cache headroom on each 96 GB GB200 GPU. Topology filenames retain the 'dep8' / 'dep16' historical names from the vLLM PR #1129 sibling for symmetry — the actual sglang_config is TP-only.

Run after moe-dense-tp-size: 1 added still hit: ValueError: Weight output_partition_size = 192 is not divisible by weight quantization block_n = 128. Verified in upstream sglang dp_attention.py (compute_dp_attention_local_info): if not enable_dp_attention: return tp_rank, tp_size, 0 # moe_dense_tp_size IGNORED The flag is only honored when enable_dp_attention=True. Since we already dropped DP-attention to avoid the fork's mxfp4_deepseek bug, moe-dense-tp-size: 1 was a no-op. Two valid paths: (a) re-enable DP-attention without DeepEP — speculative, never tested (b) drop to TP=4 — 1536/4=384 divides cleanly by 128, FP8 quant passes. Matches NVIDIA/srt-slurm PR #75 (the only verified- working DSV4 sglang disagg recipe upstream) verbatim. Going with (b). Recipes drop moe-dense-tp-size (no longer needed at TP=4) and switch tensor-parallel-size to 4 in both prefill+decode. gpus_per_prefill / gpus_per_decode drop to 4 (single GB200 node per worker). prefill_nodes / decode_nodes track worker counts. Topology shape (filenames keep historical dep8/dep16 naming for symmetry with the vLLM #1129 sibling; actual config is TP=4): - 1k1k 1p1d-tep8: P TP=4 / D TP=4 (2 nodes total) - 1k1k 1p1d-dep16: P TP=4 / D TP=4 (2 nodes total) — same shape, different conc - 1k1k 3p1d-dep16: P 3*TP=4 / D TP=4 (4 nodes) - 8k1k 1p1d-tep8: P TP=4 / D TP=4 (2 nodes) - 8k1k 3p1d-dep16: P 3*TP=4 / D TP=4 (4 nodes) - 8k1k 7p1d-dep16: P 7*TP=4 / D TP=4 (8 nodes) nvidia-master.yaml updated to match (tp: 4, ep: 1, dp-attn: false on every prefill+decode block — including the commented 8k/1k block). Also bumped SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK 1024 → 2048 in all env blocks (DeepEP path is dormant in this config but the env var is in place for re-enabling later).

* Day 0 DeepSeek V4 Pro FP4 GB200 disaggregated SGLang benchmarks * Drop unsupported backend.connector field from sglang recipes srtctl SrtConfig schema rejects backend.connector for the sglang backend type. The field was carried over from the dynamo-vllm dsv4 recipes (where it is valid and set to null). PR #69/#75 sglang recipes upstream do not declare it. * Drop dynamo: version: 0.8.1 — incompatible with deepseek-v4-grace-blackwell sglang fork Re-installing dynamo 0.8.1 over the lmsysorg/sglang:deepseek-v4-grace-blackwell container's pre-baked sglang fails at import time: File ".../dynamo/sglang/health_check.py", line 20 def _get_bos_token_id_from_engine(engine: Optional[sgl.Engine]) AttributeError: module 'sglang' has no attribute 'Engine' The DSV4 sglang fork bundled in this image does not expose sgl.Engine. Drop the dynamo: block so srtctl uses the dynamo build pre-installed in the container — matches NVIDIA/srt-slurm PR #75 (the only upstream DSV4 sglang disagg recipe), which also has no dynamo: block. * Add dynamo: install: false — srtctl default is install=True srtctl's DynamoConfig (src/srtctl/core/schema.py L680) defaults to install=True, which pip installs dynamo 0.8.0 even when no `dynamo:` block is specified. Use the explicit opt-out so srtctl uses the dynamo build baked into the lmsysorg/sglang:deepseek-v4-grace-blackwell image. This image's sglang fork doesn't expose sgl.Engine, which dynamo.sglang.health_check imports at top level — re-installing dynamo over it breaks startup. * Pin dynamo to v1.2.0-sglang-deepseek-v4-dev.1 tag (hash 21f135f5) install: false fixed the pip-install crash, but the lmsysorg/sglang:deepseek-v4-grace-blackwell image doesn't have dynamo pre-installed (ModuleNotFoundError: No module named 'dynamo'), so srtctl needs to install something compatible. The DSV4-targeted dynamo tag v1.2.0-sglang-deepseek-v4-dev.1 (sha 21f135f5edf40e12e6ff5db2b462d862a6d6ab9b) includes 'from __future__ import annotations' in dynamo/sglang/health_check.py (ai-dynamo PR #7255, commit cdb7218a, 2026-03-12), which makes the Optional[sgl.Engine] annotation lazy. The PyPI 0.8.0/0.8.1 releases predate that fix and crash with AttributeError on this image's sglang fork. * Force deepep-mode: low_latency to work around mxfp4+DeepEP normal-dispatch bug Prefill warmup crashed in run 24941291328 with: File ".../sglang/srt/layers/quantization/mxfp4_deepseek.py", line 347 topk_output = dispatch_output.topk_output AttributeError: 'DeepEPNormalDispatchOutput' object has no attribute 'topk_output' Per sglang server_args.py, --deepep-mode defaults to 'auto', which picks 'normal' for prefill batches and 'low_latency' for decode. The mxfp4_deepseek MoE kernel only handles the low_latency dispatch output shape (which carries topk_output); the normal-dispatch output type does not, so any prefill forward (or decode warmup using forward_idle) hits the AttributeError before the worker can serve. Force deepep-mode: low_latency on every prefill + decode block that uses moe-a2a-backend: deepep. The two 1p1d-dep8-tep8 decode blocks remain TP-only (no DeepEP) and are unaffected. Run reference: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24941291328 * Drop DeepEP / DP-attn / EP — fork-only mxfp4_deepseek bug, both dispatch types broken Run after the deepep-mode: low_latency change failed again. Logs show two distinct DeepEP-path failures: 1. Prefill scheduler crash: File '.../sglang/srt/layers/quantization/mxfp4_deepseek.py', line 347 topk_output = dispatch_output.topk_output AttributeError: 'DeepEPLLDispatchOutput' object has no attribute 'topk_output' The earlier crash had 'DeepEPNormalDispatchOutput' — neither dispatch output type in this image's sglang fork exposes topk_output, so forcing low_latency vs normal mode does not help. mxfp4_deepseek.py is a fork-only file (does not exist in upstream sgl-project/sglang), so the API mismatch can only be fixed by rebuilding the image. 2. Decode CUDA graph capture crash: RuntimeError: Failed: Assertion error /sgl-workspace/DeepEP/csrc/deep_ep.cpp:1233 'x.size(0) == topk_idx.size(0) and x.size(0) <= num_max_dispatch_tokens_per_rank' DeepEP low_latency_dispatch's per-rank token cap is exceeded by the cuda-graph-max-bs we configured. Both failures are in the DeepEP path. Per upstream sgl-project/sglang (server_args.py), moe_a2a_backend defaults to 'none', which uses all-reduce/all-gather dispatch and lets TP shard the expert weights across ranks (no separate EP needed). NVIDIA/srt-slurm PR #75 (the only upstream DSV4 sglang disagg recipe) takes the same TP-only stance — pure tensor-parallel-size: N with no enable-dp-attention, no moe-a2a-backend deepep, no dp-size, no ep-size. Drop those five fields from all 6 recipes. Topology shape preserved: - 1k1k 1p1d: P TP=8 / D TP=8 (4 nodes) - 1k1k 1p1d-wide: P TP=8 / D TP=16 (6 nodes) - 1k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes) - 8k1k 1p1d: P TP=8 / D TP=8 (4 nodes) - 8k1k 3p1d-wide: P 3*TP=8 / D TP=16 (10 nodes) - 8k1k 7p1d-wide: P 7*TP=8 / D TP=16 (18 nodes) DSV4-Pro at MXFP4 (~340 GB) shards comfortably under TP=8 (~42 GB/rank) or TP=16 (~21 GB/rank) with mem-fraction-static: 0.82 leaving plenty of KV cache headroom on each 96 GB GB200 GPU. Topology filenames retain the 'dep8' / 'dep16' historical names from the vLLM PR #1129 sibling for symmetry — the actual sglang_config is TP-only. * Add moe-dense-tp-size: 1 — fix shared-experts FP8 block-quant divisibility at TP=8/16 After the DeepEP removal, model load crashed at: File '.../sglang/srt/layers/quantization/fp8.py', line 282, in validate_block_quant_shapes raise ValueError( ValueError: Weight output_partition_size = 192 is not divisible by weight quantization block_n = 128. DSV4-Pro's shared-experts gate_up_proj (intermediate ~1536) FP8-quants in 128-element blocks. With TP=8 the per-rank slice is 1536/8=192, which fails the divisibility check. PR #75 sidesteps this by using TP=4 (1536/4=384), but that locks us into single-node workers. sglang's --moe-dense-tp-size flag is the documented workaround (server_args.py: 'useful when, with large TP size, there are errors caused by weights in MLP layers having dimension smaller than the min dimension GEMM supports'). Setting moe-dense-tp-size: 1 runs the shared / dense-MLP layers replicated across ranks (TP=1) while the rest of the model — attention, routed experts — keeps TP=8/16. Memory cost is small since shared experts are a fraction of total weights. Applied to all 6 recipes; topology/node counts unchanged. * Set SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024 in all env blocks Belt-and-suspenders for the DeepEP per-rank dispatch buffer cap. The default is too low; with this set we'll have headroom if EP / DeepEP is re-enabled later (e.g., once the fork's mxfp4_deepseek dispatch API mismatch is fixed). 1024 matches the cookbook's B200 decode reference. * Switch to TP=4 single-node — match PR #75 verbatim, fix FP8 block-quant Run after moe-dense-tp-size: 1 added still hit: ValueError: Weight output_partition_size = 192 is not divisible by weight quantization block_n = 128. Verified in upstream sglang dp_attention.py (compute_dp_attention_local_info): if not enable_dp_attention: return tp_rank, tp_size, 0 # moe_dense_tp_size IGNORED The flag is only honored when enable_dp_attention=True. Since we already dropped DP-attention to avoid the fork's mxfp4_deepseek bug, moe-dense-tp-size: 1 was a no-op. Two valid paths: (a) re-enable DP-attention without DeepEP — speculative, never tested (b) drop to TP=4 — 1536/4=384 divides cleanly by 128, FP8 quant passes. Matches NVIDIA/srt-slurm PR #75 (the only verified- working DSV4 sglang disagg recipe upstream) verbatim. Going with (b). Recipes drop moe-dense-tp-size (no longer needed at TP=4) and switch tensor-parallel-size to 4 in both prefill+decode. gpus_per_prefill / gpus_per_decode drop to 4 (single GB200 node per worker). prefill_nodes / decode_nodes track worker counts. Topology shape (filenames keep historical dep8/dep16 naming for symmetry with the vLLM #1129 sibling; actual config is TP=4): - 1k1k 1p1d-tep8: P TP=4 / D TP=4 (2 nodes total) - 1k1k 1p1d-dep16: P TP=4 / D TP=4 (2 nodes total) — same shape, different conc - 1k1k 3p1d-dep16: P 3*TP=4 / D TP=4 (4 nodes) - 8k1k 1p1d-tep8: P TP=4 / D TP=4 (2 nodes) - 8k1k 3p1d-dep16: P 3*TP=4 / D TP=4 (4 nodes) - 8k1k 7p1d-dep16: P 7*TP=4 / D TP=4 (8 nodes) nvidia-master.yaml updated to match (tp: 4, ep: 1, dp-attn: false on every prefill+decode block — including the commented 8k/1k block). Also bumped SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK 1024 → 2048 in all env blocks (DeepEP path is dormant in this config but the env var is in place for re-enabling later). * Restore mi355x retry changelog entries clobbered by merge The merge of main into this branch (c0aec93) accidentally overwrote the two dsv4-fp8-mi355x-sglang retry entries (PR #1148 retry-pair tail and PR #1159 retry-pair) with duplicated copies of our own dsv4-fp4-gb200-dynamo-sglang entry. The process_changelog.py gate rejects deletions, so the workflow blocked. Restore the two mi355x entries verbatim from origin/main and keep a single copy of our dsv4 entry, appended after the restored mi355x block. perf-changelog.yaml diff vs origin/main is now additions-only. * Switch back to TP=8: enable-dp-attention + moe-dense-tp-size: 1, no moe-a2a-backend TP=4 OOMed — DSV4-Pro at MXFP4 doesn't fit on a single GB200 node. Need TP=8 across 2 nodes (768 GB total). But TP=8 trips two issues that earlier rounds papered over: a) shared-experts gate_up_proj FP8 block-quant divisibility (1536/8=192, not a multiple of block_n=128) b) the lmsysorg/sglang:deepseek-v4-grace-blackwell fork's mxfp4_deepseek kernel crashes on every DeepEP forward path Single combo that solves both — verified in upstream sglang source: * enable-dp-attention: true + moe-dense-tp-size: 1 Runs dense / shared-MLP layers replicated (TP=1) — fixes (a). moe-dense-tp-size IS gated on enable_dp_attention=True per python/sglang/srt/layers/dp_attention.py (compute_dp_attention_local_info ignores it when DP-attn is off). * NO moe-a2a-backend set (default 'none') Lands the model on forward_normal instead of forward_deepep — avoids (b). Verified in deepseek_v2.py: _enable_a2a_moe = is_deepep | is_mooncake | is_nixl | is_mori | is_ascend_fuseep | is_flashinfer With backend='none' this is False and forward_normal runs. Recipes: tensor-parallel-size 4 → 8 (both prefill+decode); add moe-dense-tp-size: 1, enable-dp-attention: true, dp-size: 8 to every sglang_config block; gpus_per_prefill / gpus_per_decode 4 → 8; prefill_nodes / decode_nodes scale to workers × 2. nvidia-master.yaml mirrors: tp 4 → 8, dp-attn false → true on every prefill+decode block (active 1k/1k + commented 8k/1k). Topology shape restored to: - 1k1k 1p1d-* : 4 nodes (was 2) - 1k1k 3p1d-* : 8 nodes (was 4) - 8k1k 1p1d-* : 4 nodes (commented) - 8k1k 3p1d-* : 8 nodes (commented) - 8k1k 7p1d-* : 16 nodes (commented) * Scope sweep to high-conc DeepEP only — temporarily comment 1p1d blocks Comment out the low-conc (1-64) and mid-conc (128-4096) search-space entries in nvidia-master.yaml so the sweep iterates only on the high- conc 3p1d-dep8-dep16 topology. Re-enable DeepEP on that one recipe to exercise the EP path: 3p1d-dep8-dep16 prefill+decode: + ep-size: 8 + moe-a2a-backend: "deepep" + deepep-mode: low_latency (kept enable-dp-attention + moe-dense-tp-size: 1 + tp=8 / dp=8) Master matrix label updated to ep=8 to reflect the recipe. Sibling 1p1d recipes on disk are unchanged (still TP=8 + DP-attn, no DeepEP). They are still referenced by the commented-out master entries — restore them by uncommenting. * tep fix + dep for high conc * sike no dpa * Cap SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK at 1024 — sglang LL hard ceiling DeepEP run (3p1d-dep8-dep16) crashed at: File '.../sglang/srt/layers/moe/token_dispatcher/deepep.py', line 325 assert self.num_max_dispatch_tokens_per_rank <= 1024 AssertionError _DeepEPDispatcherImplLowLatency enforces a hard upper bound of 1024 in low_latency mode. We had bumped the env var to 2048 to give headroom above the earlier C++ side cap (deep_ep.cpp:1233 'x.size(0) <= num_max_dispatch_tokens_per_rank'), but 2048 trips this Python-side assertion at scheduler init. 1024 is the exactly-allowed value: high enough to cover the cuda-graph-max-bs we use, low enough to satisfy the LL dispatcher constructor. Apply 2048 → 1024 across all 6 recipes (every prefill + decode env block). * Revert 3p1d-dep8-dep16 to no-DeepEP TP-only; uncomment full 1k/1k + 8k/1k sweep DeepEP is broken on the lmsysorg/sglang:deepseek-v4-grace-blackwell image — verified across three runs (deepep-mode auto/normal, deepep-mode low_latency, and the latest 3p1d try). All hit the fork-only mxfp4_deepseek.py:347 reading dispatch_output.topk_output, which neither DeepEPLLDispatchOutput nor DeepEPNormalDispatchOutput exposes in this fork. Cannot be fixed from the recipe — needs the image rebuilt with mxfp4_deepseek patched, or an upstream sglang fix. 3p1d-dep8-dep16 recipe: drop ep-size, moe-a2a-backend, deepep-mode from prefill+decode. Now matches the 1p1d siblings: TP=8 + DP=8 + moe-dense-tp-size: 1, default 'none' a2a backend (forward_normal path bypasses the buggy mxfp4_deepseek kernel). nvidia-master.yaml: * Uncomment the 1k/1k mid-conc and 8k/1k blocks (low + mid + high). * 3p1d-dep8-dep16 matrix label ep: 8 → ep: 1 to match recipe. Sweep now expands to 6 entries / 27 conc points (3 1k/1k + 3 8k/1k). * Try moe-a2a-backend: flashinfer on 3p1d-dep8-dep16 for high-conc EP DeepEP is dead in this image (mxfp4_deepseek.py:347 reads dispatch_output.topk_output, neither DeepEPNormal nor DeepEPLL output exposes that field). Smoke test the only other plausible EP backend upstream sglang offers: flashinfer. Per upstream docs/advanced_features/expert_parallelism.md, flashinfer is the documented option for 'Large-scale EP deployments' and uses a different dispatcher than DeepEP — its output class may or may not trip the same mxfp4_deepseek bug. Per server_args.py _handle_a2a_moe, flashinfer auto-sets SGLANG_MOE_NVFP4_DISPATCH=True and forces ep_size = tp_size, so we set ep-size: 8 explicitly. Everything else (TP=8 / DP=8 / moe-dense-tp-size: 1) stays so the FP8 block-quant path remains valid. Scope: 1k/1k 3p1d-dep8-dep16 only. If the EP path serves on this image, port back to the 1p1d siblings; if it crashes the same way DeepEP did, revert to the no-EP forward_normal path and accept the TP-only pareto. nvidia-master.yaml matrix labels for the 3p1d entry updated to ep=8 to match the recipe. * Revert flashinfer EP attempt — accept TP-only pareto, every EP backend dead on this image flashinfer EP smoke test (3p1d-dep8-dep16 1k/1k) crashed at startup: File '.../sglang/srt/server_args.py', line 2133, in _handle_a2a_moe assert self.moe_runner_backend in [...] AssertionError: Flashinfer MoE A2A is only supported with flashinfer_cutlass moe runner backend flashinfer_cutlass is FP8-only — won't load DSV4-Pro's MXFP4 weights. The only path that satisfies the assertion would also fail at model load. So flashinfer is unusable for DSV4 on any image that doesn't ship a flashinfer_mxfp4_cutlass runner (which doesn't exist). Combined with the earlier deepep failure (mxfp4_deepseek.py:347 AttributeError on dispatch_output.topk_output, both Normal and LL dispatch types), every EP backend sglang exposes in this image is dead. Remaining options (mooncake, nixl-ep, mori, ascend_fuseep) are either Ascend-NPU-only or not wired into this image. Revert 3p1d-dep8-dep16 recipe to no-EP TP-only (matches the 5 sibling recipes) and master.yaml matrix labels (ep: 8 → ep: 1). PR description's Known Issues section updated to a 4-row table covering every EP backend tried and accepted as dead end. * fix(sglang): bump 8k1k prefill max-running-requests from 4 to 8 sglang computes per-rank capacity as max_running_requests // dp_size. With dp-size=8, a value of 4 floors to 0, hitting the "max_running_request is zero" assertion in tp_worker.py:277. Bump to 8 so each DP rank gets at least 1 slot — matches the working 1p1d recipe. * ports * Dsv4 fp4 gb200 dynamo sglang disagg (#1213) * Modify deepseek-v4 configuration for new model settings * Update YAML configuration for deepseek model * adapt for model path, etc * dev * upd * fix * fix * test * add gb300 * upd * fix * fix * fix * fix(launch_gb300-cw): register deepseek-v4-pro alias in model_paths After fixing the recipe overlay path in 1b07108, srtctl now loads our hand-rolled SGLang recipe and runs preflight, which rejects: Error: Preflight failed for recipes/sglang/.../disagg-gb300-2p1d-dep4-dep8.yaml: - model.path: Model 'deepseek-v4-pro' is not a local model path and is not defined in srtslurm.yaml model_paths. Both `disagg-gb300-2p1d-dep4-dep8.yaml` and `disagg-gb300-7p1d-dep4-dep8.yaml` declare `model.path: deepseek-v4-pro` (per the recipe header comment, the alias is intentionally aligned with `launch_gb200-nv.sh`'s srtslurm.yaml, which exports `SRT_SLURM_MODEL_PREFIX=deepseek-v4-pro`). The gb300-cw launcher only registered `dspro` and `dsv4-pro`, so the alias never resolved. Add `deepseek-v4-pro` mapping to the same `${MODEL_PATH}`. * fix(launch_gb300-cw): pull arm64 squash and force fresh import per runner After fixing model.path alias (fe6815c), the slurm orchestrator reached the head infrastructure srun and crashed at: [ERROR] Invalid image format: /mnt/vast/squash_dupe/lmsysorg_sglang_deepseek-v4-grace-blackwell.sqsh error: pyxis: failed to create container filesystem error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 Two issues: 1. The runner pod that runs `enroot import docker://lmsysorg/sglang:...` is x86, so without `--arch` enroot fetches the amd64 manifest. The compute nodes (slurm-gb300-138-*) are aarch64 and pyxis there rejects the amd64 squash with "Invalid image format". Pass `--arch arm64` and tag the cache filename with `_arm64`. 2. `enroot import -o existing.sqsh ...` aborts with `[ERROR] File already exists` and leaves the stale file in place, so once a half-baked or pre-tag-update squash lands at this path it is silently reused on every subsequent CI run. Inspecting /mnt/vast/squash_dupe showed an Apr 26 amd64 sqsh shadowing the Apr 28 working arm64 sqsh exactly like this. `rm -f` before each import forces fresh downloads and picks up Docker tag updates. 3. Scope the squash filename per RUNNER_NAME (gb300-cw_0..3) so that the four matrix runners do not race on rm+import of the same shared path on /mnt/vast. Cost: ~64 GB on /mnt/vast (4 runners × 16 GB per arm64 sqsh) instead of 16 GB shared, which is fine on the shared VAST mount. * fix(launch_gb300-cw): use enroot --arch aarch64, not arm64 enroot 4.0.1's `common::debarch()` accepts kernel-style arch names (`x86_64`, `aarch64`, `ppc64le`) and emits Docker-style names (`amd64`, `arm64`, `ppc64le`) on the wire. Passing `--arch arm64` (the Docker manifest name) trips the function's else branch immediately: [ERROR] Unsupported architecture: arm64 Use the kernel name `aarch64` so enroot can map it to docker's `arm64` manifest internally. * fix(launch_gb300-cw): use pre-staged arm64 sqsh, drop in-CI enroot import Even with `--arch aarch64`, `enroot import` from the CI runner pod (x86) fails when converting the arm64 image: [INFO] Converting whiteouts... /usr/bin/bash: line 1: /usr/bin/enroot-aufs2ovlfs: Operation not permitted (repeated dozens of times, then preflight reports the sqsh as missing) `enroot-aufs2ovlfs` requires CAP_SYS_ADMIN that the runner pod doesn't hold, and `lmsysorg/sglang:deepseek-v4-grace-blackwell` is arm64-only, so the conversion can't be skipped either. Per the documented manual flow at https://gist.github.com/Fridge003/42c6001e0bb613acf0e411305b8ea780 the import has to be dispatched to an aarch64 GB300 compute node via `srun`. Rather than running an extra slurm job per CI invocation just to prepare the sqsh, point the launcher at the pre-staged arm64 sqsh that already lives at `/mnt/vast/squash_dupe/lmsysorg_sglang_deepseek-v4-grace-blackwell_arm64.sqsh` (refreshed manually via the gist script when the docker tag is bumped). The matching `nginx_1.27.4_arm64.sqsh` was symlinked alongside. Add a fast-fail check so a missing pre-staged sqsh produces a clear error instead of a confusing pyxis "Invalid image format" three steps later. * fix(launch_gb300-cw): persist dynamo wheel cache and ulimit preamble Two follow-up fixes after CI started successfully reaching slurm but the dynamo-from-source step (`dynamo: hash: 9d3c913d…`) is rebuilt cold on every CI run, taking ~10-20 minutes per matrix job: 1. Cluster-wide dynamo wheel cache. srtctl's `_hash_cached_source_install` (`src/srtctl/core/schema.py:912`) is already designed to cache hash-pinned builds at `/configs/dynamo-wheels/<hash>/{ai_dynamo_runtime-*.whl,dynamo-src.tar.gz,.complete}` under flock. The cache only works if `/configs/dynamo-wheels` survives between CI runs, but the launcher does `rm -rf srt-slurm` and re-clones every time, blowing it away. Mount `/mnt/vast/dynamo-wheels-cache` (NFS, shared by every gb300-cw_N runner) over `/configs/dynamo-wheels` via srtslurm.yaml `default_mounts`, so the cache survives `rm -rf` and is shared across all matrix jobs. After the first cold build the warm path should drop dynamo install to ~30 s. 2. Cluster-wide bash preamble for ulimits. yangminl's manual setup on this cluster (`/mnt/home/yangminl/srt-slurm/srtslurm.yaml`) sets `default_bash_preamble: "ulimit -n 1048576 && ulimit -a"` so the dynamo frontend / sglang servers can accept the 8192-concurrency sweep without `EMFILE: too many open files`. Mirror that here. The feature is supported by srtctl's pinned commit (`src/srtctl/core/slurm.py:_get_cluster_bash_preamble`). * fix(sglang/dsv4/8k1k recipes): set cpus-per-task=144 for dynamo build slurm assigns 1 CPU/task by default; `scontrol show job <id>` from a recent CI run shows `NumCPUs=4 NumTasks=4 CPUs/Task=1` with 4 nodes, i.e. one core per worker. The dynamo `hash:` source install rebuilds ~500 rust crates (kube-client, tonic, hf-hub, image codecs ravif/exr, pyo3 stack) and at one core takes 30+ min just for the cold build, which dominates total CI time even with the new `/configs/dynamo-wheels` cache (the cache only helps after the first cold run). Match yangminl's working manual setup (`/mnt/home/yangminl/srt-slurm/recipes/dsv4-pro/sglang/gb300-fp4/all-dynamo.yaml`) which sets `sbatch_directives.cpus-per-task: "144"` so cargo gets the full GB300 host (144 cores) and finishes maturin in a few minutes. * fix(sglang/dsv4/8k1k recipes): set cpus-per-task=144 and mem=0 slurm assigns 1 CPU/task by default; `scontrol show job 613` from a running CI job confirmed `NumCPUs=4 NumTasks=4 CPUs/Task=1` with 4 nodes — one core per worker. The dynamo `hash:` cold source install rebuilds ~500 rust crates (kube-client, tonic, hf-hub, image codecs ravif/exr, the pyo3 stack) and at one core takes 30+ min just for the cold build, which dominates total CI time even with the new `/configs/dynamo-wheels` cache (the cache only helps after the first cold run). Match yangminl's working manual setup on the same gb300-cw cluster (`/mnt/home/yangminl/srt-slurm/recipes/dsv4-pro/sglang/gb300-fp4/all-dynamo.yaml`) which sets: sbatch_directives: cpus-per-task: "144" mem: "0" cargo then gets the full 144-core GB300 host and finishes maturin in a few minutes; mem=0 hands the worker the entire node's RAM so the dynamo build + DSV4-Pro 671B FP4 weight load fit without OOM. * fix(launch_gb300-cw): pin srt-slurm fork with parallel sa-bench The current sa-bench in NVIDIA/srt-slurm@9d75f82 generates random prompts single-threaded, which dominates 7p1d/conc=8192 bench startup (~50 min just for the 81920-prompt main pass before the first HTTP request reaches dynamo). Pin to fzyzcjy/srt-slurm fork branch `feat/random-num-workers` (commit 8094cfb), which is 9d75f82 + the SemiAnalysisAI/InferenceX `utils/bench_serving/` benchmark_serving.py ported into sa-bench. With `--random-num-workers 48` (now the default in bench.sh) prompt generation drops to ~1 min on a 144-core GB300 host, putting the bench-startup cost on the same order as infra+model-load instead of dominating it. The fork is paired with the upstream PR NVIDIA/srt-slurm#114; once that merges, this pin should revert to the bumped NVIDIA/srt-slurm SHA. * fix(launch_gb300-cw): bump srt-slurm fork pin to minimal multiproc patch Previous pin (8094cfb) was a wholesale replacement of sa-bench with the SemiAnalysisAI/InferenceX bench_serving — that dropped `async_request_dynamo_completions` from `ASYNC_REQUEST_FUNCS`, so `bench.sh` would have died on `--backend dynamo` argparse rejection the moment the bench client started. New pin (4249d16) is a tight ~100-line patch on top of NVIDIA/srt-slurm@9d75f82 that only adds parallel random prompt generation (`--random-num-workers`); everything else, including the dynamo backend and `--custom-tokenizer` plumbing, stays exactly the same as upstream. See NVIDIA/srt-slurm#114. * ci: temporarily comment out conc-list:[64] 2p1d entry Focus CI on the conc=8192 7p1d max-throughput entry only — re-enable the 2p1d/conc=64 mid-curve entry shortly once that's green. * ci(eval): temporarily skip dsv4-fp4-gb300 dynamo-sglang eval-only entry The srt-slurm pin (9d75f82, recipes/dsv4-agg-disagg) lacks the lm-eval orchestrator path that lives on sa-submission-q2-2026. Skip the auto-generated eval-only matrix entry for this config until the pin is bumped. TODO: remove this branch once the pin is moved to sa-submission-q2-2026 (which already carries the EVAL_ONLY do_sweep.py branch and lm-eval/bench.sh). * bench(7p1d-dep4-dep8): swap sa-bench default for yangminl's gb300-cw recipe Replace the sa-bench builder (concurrencies=8192, req_rate=inf, sa-bench default num_prompts/num_warmups multipliers) with the exact custom command from yangminl's gb300-cw 8k1k_hightpt[0] run (slurm job 564 on the dsv4-pro-gb300-fp4 cluster): concurrency=4096, rate=48, num_prompts=40960, num_warmups=512, random_num_workers=96. Why mirror those exact knobs: that recipe is what produced the 7p1d reference numbers we benchmarked against (358K total tok/s, 39.9K output tok/s, ~5s mean TTFT). Running sa-bench at concurrency=8192/rate=inf will saturate the 1-decode-worker GPU (we observed 16384 concurrency on job 617 saturated decode at ~390 running/rank with mean TTFT ~257s, i.e. equilibrium gated by decode compute, not the bench), making the result not directly comparable. Bench framework note: the fzyzcjy fork's benchmark_serving.py / benchmark_utils.py / encoding_dsv4.py are byte-identical to upstream SemiAnalysisAI/InferenceX/main; only backend_request_func.py adds five per-request debug print sites (ok=/lat=/url=/plen=/err=). Throughput numbers should match sa-bench at the same flags; the fork is chosen here to keep parity with the reference run's logs. Skipped on purpose: - DeepGEMM env knobs (SGLANG_DG_CACHE_DIR / SGLANG_JIT_DEEPGEMM_PRECOMPILE vs SGLANG_JIT_DEEPGEMM_FAST_WARMUP=1) — yangminl's cache dir is /configs/deepgemm_cache on the gb300-cw host and isn't portable here; PR's FAST_WARMUP path stays. - expert_location_dispatch.py topk_ids int32 cast (yangminl commits 94b7dc4c7 + e933ef2b1 on the patched sglang fork) — not pulling that into the container build. * config(7p1d-dep4-dep8): align with job 564 — multi-frontend, sbatch dirs, name Eliminate every non-cluster-specific diff vs job 564's resolved config (`/outputs/564/config_8k1k_hightpt_0.yaml`): - name: match `dsv4-pro-gb300-fp4_8k1k_hightpt_0` (was stale gb200 string) - frontend.enable_multiple_frontends: false → true; add num_additional_frontends: 8 (job 564 ran 9 dynamo frontends behind nginx; PR was running a single frontend, which was a real router-side runtime diff) - slurm.time_limit: 8h → 3h to match job 564 - sbatch_directives.cpus-per-task: 144, mem: 0 (portable, was missing) - drop health_check block (job 564 doesn't set it; rely on srtctl default) Remaining diffs vs job 564 are all either cluster-specific path bindings (slurm.partition=hpc-mid, frontend.nginx_container, extra_mount of yangminl's patched sglang) or DG-cache env (SGLANG_DG_CACHE_DIR / SGLANG_JIT_DEEPGEMM_PRECOMPILE) — those need InferenceX-cluster-side equivalents and are documented in the header comment. * config(7p1d-dep4-dep8): keep PR name field, revert to original * upd * fix * fix * middle * fi * fix * upd * fix * upd --------- Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Cheng Wan <chwan@rice.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com>

Oseltamivir requested a review from a team April 24, 2026 04:12

Oseltamivir requested a review from kedarpotdar-nv as a code owner April 24, 2026 04:12

github-project-automation Bot added this to InferenceMAX Board Apr 24, 2026

Oseltamivir requested a review from jgangani as a code owner April 24, 2026 04:12

Oseltamivir added the sweep-enabled label Apr 24, 2026

claude Bot reviewed Apr 24, 2026

View reviewed changes

Oseltamivir added 7 commits April 23, 2026 21:24

flags

1bb8494

import

41e71b8

flags

4854a7a

recipe change

ac030e6

prompt

b592c60

prompt

11a4c08

prompt

9359fe8

Oseltamivir changed the title ~~Add DeepSeek V4 Flash FP4 GB200 disaggregated vLLM benchmarks~~ Add DeepSeek V4 Pro FP4 GB200 disaggregated vLLM benchmarks Apr 24, 2026

weight loading

1d51ba1

functionstackx changed the title ~~Add DeepSeek V4 Pro FP4 GB200 disaggregated vLLM benchmarks~~ [Pipe Cleaning PR - while vllm maintainers] Add DeepSeek V4 Pro FP4 GB200 disaggregated vLLM benchmarks Apr 24, 2026

sweep

4ce52cd

Oseltamivir mentioned this pull request Apr 24, 2026

Add DeepSeek-V4-Pro SGLang aggregated GB200 benchmarks (NVIDIA srt-slurm PR #69) #1137

Open

5 tasks

Oseltamivir added 8 commits April 24, 2026 09:31

Merge remote-tracking branch 'origin/main' into dsv4-fp4-gb200-dynamo…

8ed435e

…-vllm-disagg

path

af10ca0

Revert "Adopt NVIDIA srt-slurm PR #71 recipes (sans offload) for 8k/1…

f524584

…k DSV4" This reverts commit 768cddc.

change concs

84be0b3

Oseltamivir added full-sweep-enabled and removed sweep-enabled labels Apr 24, 2026

Oseltamivir removed full-sweep-enabled labels Apr 24, 2026

Oseltamivir and others added 3 commits April 24, 2026 16:17

conc changes

4666f60

Merge branch 'main' into dsv4-fp4-gb200-dynamo-vllm-disagg

e8922c5

Oseltamivir added the sweep-enabled label Apr 24, 2026

Oseltamivir and others added 5 commits April 24, 2026 16:36

Merge branch 'main' into dsv4-fp4-gb200-dynamo-vllm-disagg

86bf700

perfchangelog

a51db71

Merge branch 'main' into dsv4-fp4-gb200-dynamo-vllm-disagg

17b4a46

Oseltamivir changed the title ~~[Pipe Cleaning PR - while vllm maintainers] Add DeepSeek V4 Pro FP4 GB200 disaggregated vLLM benchmarks~~ Day 0 DeepSeek V4 Pro FP4 GB200 disaggregated vLLM benchmarks Apr 25, 2026

Oseltamivir and others added 5 commits April 24, 2026 19:34

activate 8k1k

9cd8f70

Merge branch 'main' into dsv4-fp4-gb200-dynamo-vllm-disagg

6859910

Oseltamivir merged commit 3d416ba into main Apr 25, 2026
9 of 16 checks passed

Oseltamivir deleted the dsv4-fp4-gb200-dynamo-vllm-disagg branch April 25, 2026 03:45

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 25, 2026

claude Bot mentioned this pull request Apr 25, 2026

vLLM GB300 Day 0 DSV4 FP4 disagg #1150

Open

3 tasks

Oseltamivir mentioned this pull request Apr 25, 2026

Day 0 DeepSeek V4 Pro FP4 GB300 disaggregated SGLang benchmarks #1157

Merged

5 tasks

Ankur-singh mentioned this pull request Apr 26, 2026

[NV] dsv4-fp4-gb200-dynamo-vllm #1163

Merged

claude Bot mentioned this pull request Apr 26, 2026

Replace DSv4 8k1k recipes with NVIDIA/srt-slurm PR #78 #1164

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Day 0 DeepSeek V4 Pro FP4 GB200 disaggregated vLLM benchmarks#1129

Day 0 DeepSeek V4 Pro FP4 GB200 disaggregated vLLM benchmarks#1129
Oseltamivir merged 33 commits intomainfrom
dsv4-fp4-gb200-dynamo-vllm-disagg

Oseltamivir commented Apr 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot Apr 24, 2026

Uh oh!

Oseltamivir commented Apr 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Active sweep

Files

Recipe-reminder bot response

Test plan

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

What the bug is

Step-by-step proof

Why existing code does not prevent this

Addressing the refutation

Fix

Uh oh!

Oseltamivir commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Apr 24, 2026 •

edited

Loading

Oseltamivir commented Apr 25, 2026 •

edited

Loading