[NV] dsv4-fp4-gb200-dynamo-vllm by Ankur-singh · Pull Request #1163 · SemiAnalysisAI/InferenceX

Ankur-singh · 2026-04-26T02:40:31Z

Summary

Rebase DSv4-FP4 GB200 dynamo-vLLM disaggregated recipes onto NVIDIA/srt-slurm aflowers/vllm-gb200-v0.20.0 branch, upgrading from the deepseekv4-cu130 image to vLLM v0.20.0 (vllm/vllm-openai:v0.20.0-ubuntu2404). Drops the previous 1k/1k sequence-length configs and local workarounds (numa-bind, chat template, tokenizer mode) that earlier submissions required.

Supersedes the previous srt-slurm branch (sa-submission-q2-2026) and mirrors three validated 8k/1k benchmark points from upstream.

Changes

Image & branch update

vLLM image: vllm/vllm-openai:deepseekv4-cu130 → vllm/vllm-openai:v0.20.0-ubuntu2404
srt-slurm branch: sa-submission-q2-2026 → aflowers/vllm-gb200-v0.20.0
Container squash path made dynamic (launch_gb200-nv.sh) — derives squash filename from $IMAGE instead of hardcoding
Re-enabled enroot import for the vLLM container (was previously commented out)

Recipe changes (8k/1k)

Three new recipes mirrored from upstream, replacing the previous 6-topology sweep:

Recipe	Topology	Nodes	Concurrency
`disagg-gb200-low-latency.yaml`	1P (DEP=8) / 1D (TP=8)	5 (incl. infra)	1
`disagg-gb200-mid-curve.yaml`	1P (DEP=8) / 1D (DEP=8)	5 (incl. infra)	256
`disagg-gb200-max-tpt.yaml`	3P (DEP=8) / 1D (DEP=8)	9 (incl. infra)	4096

Two existing offload recipes updated in-place:

disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml — rebased on upstream v0.20.0 recipe
disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml — rebased on upstream v0.20.0 recipe

Deleted recipes

1k/1k configs removed entirely — 1k1k/disagg-gb200-1p1d-dep8-dep16.yaml, 1k1k/disagg-gb200-3p1d-dep8-dep16.yaml (no upstream NVIDIA reference for 1k/1k DSv4 disagg)
8k1k/disagg-gb200-7p1d-dep8-dep16.yaml — 18-node full-cluster topology dropped in favor of validated 3P/1D max-tpt recipe

nvidia-master.yaml

Simplified dsv4-fp4-gb200-dynamo-vllm search-space from 6 concurrency bands (across 1k/1k + 8k/1k) down to 3 targeted 8k/1k points
Model updated to deepseek-ai/DeepSeek-V4-Pro

Other

perf-changelog.yaml entry added
All recipes use compute-node local NVMe model weights via /mnt/numa1/models/deepseek-v4-pro/
All recipes include a dedicated NATS/etcd infra node

Upstream references

NVIDIA/srt-slurm branch: aflowers/vllm-gb200-v0.20.0
NVIDIA/srt-slurm PR: #77 (supersedes [NVIDIA] port b200 from docker to slurm due to change of cluster #71)

…-dsv4-recipes (PR #77) Repoint launch_gb200-nv.sh to NVIDIA/srt-slurm@aflowers/gb200-dsv4-recipes, which supersedes #71 and ships the vllm_numa_bind_hash_fix.py patch and sa-bench DSV4 tokenizer support — so numa-bind, benchmark.use_chat_template, and benchmark.tokenizer_mode no longer have to be stripped from recipes. 8k/1k search-space expanded from 3 topologies to 8: adds 1p4d/1p8d pure-TP decode (offload), 1p1d/2p1d/3p1d DEP-8 decode, and a 3p1d-dep16-40 wide decode shape. 1k/1k topologies unchanged (no upstream 1k/1k counterpart); 1k/1k tep8 also re-enables numa-bind + chat template to stay consistent. Local recipe deltas vs upstream are limited to: model.path alias rename deepseekv4-fp4 -> deepseek-v4-pro (matches SRT_SLURM_MODEL_PREFIX), container kept on the floating :deepseekv4-cu130 tag, slurm.time_limit added, and health_check.max_attempts bumped 360 -> 1440 for cold-cache loads.

…amo-vllm # Conflicts: # perf-changelog.yaml

The 1k/1k tep8 numa-bind + chat-template re-enabling is rolled back — 1k/1k stays at the previous local-extrapolation tuning. Updates the perf-changelog entry to reflect that.

claude

Additional findings (outside current diff — PR may have been updated during review):

🔴 benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml:140-144 — Within the 1k/1k sweep, this PR updated only 1k1k/disagg-gb200-1p1d-dep8-tep8.yaml (use_chat_template: true, tokenizer_mode: deepseek_v4, prefill numa-bind: true) but left the two sibling DEP-16 recipes (1k1k/disagg-gb200-1p1d-dep8-dep16.yaml and 1k1k/disagg-gb200-3p1d-dep8-dep16.yaml) on the old workaround path (use_chat_template: false, no tokenizer_mode, no numa-bind). All three are routed from the same dsv4-fp4-gb200-dynamo-vllm 1k/1k search-space (.github/configs/nvidia-master.yaml 7577-7620), so conc 1-64 now run through sa-bench's chat-template + DSV4 tokenizer path while conc 128-8192 use the legacy random-token /v1/completions path — the changelog explicitly states the tep8 update was made "to stay consistent" but the result is the opposite. Either also flip the two DEP-16 recipes (use_chat_template: true, tokenizer_mode: deepseek_v4, prefill numa-bind: true) or revert the tep8 update.

Extended reasoning...

What is happening

The PR description says "drops local workarounds: numa-bind, benchmark.use_chat_template, and benchmark.tokenizer_mode are restored now that PR #77 ships vllm_numa_bind_hash_fix.py and sa-bench DSV4 tokenizer support", and the perf-changelog entry adds "1k/1k tep8 also re-enables numa-bind + chat template to stay consistent". The intent is clearly to get all 1k/1k recipes onto the same input/tokenization pipeline as the 8k/1k recipes that PR #77 ships.

But only the 1k/1k tep8 recipe was actually touched. The two DEP-16 siblings still carry the pre-PR workaround:

1k/1k recipe	use_chat_template	tokenizer_mode	prefill numa-bind
`1k1k/disagg-gb200-1p1d-dep8-tep8.yaml` (modified)	`true`	`deepseek_v4`	`true`
`1k1k/disagg-gb200-1p1d-dep8-dep16.yaml` (line 125, unchanged)	`false`	(unset)	(unset)
`1k1k/disagg-gb200-3p1d-dep8-dep16.yaml` (line 117, unchanged)	`false`	(unset)	(unset)

Why it matters

.github/configs/nvidia-master.yaml lines 7577-7620 route all three of these inside the same dsv4-fp4-gb200-dynamo-vllm 1k/1k seq-len-config:

conc [1, 4, 8, 16, 32, 64]                  -> 1p1d-dep8-tep8.yaml         (chat_template=true,  tokenizer_mode=deepseek_v4)
conc [128, 256, 1024, 2048, 4096]           -> 1p1d-dep8-dep16.yaml        (chat_template=false, tokenizer_mode unset)
conc [4096, 8192]                           -> 3p1d-dep8-dep16.yaml        (chat_template=false, tokenizer_mode unset)

sa-bench's use_chat_template: true path calls tokenizer.apply_chat_template() and adds role markers / system prompt, so the actual on-the-wire prompt token count and request endpoint differs from the random-token /v1/completions path used when use_chat_template: false. So within a single 1k/1k sweep:

conc 1-64 (tep8) measures latency through chat-template + DSV4-aware tokenization
conc 128-8192 (DEP-16) measures throughput through raw tokens, no DSV4 tokenizer, plain completions endpoint

This breaks the comparable-curve invariant the PR was trying to restore: a low-conc TEP latency point and a higher-conc DEP-16 throughput point on the same 1k/1k pareto front are no longer apples-to-apples, because their input/tokenization pipelines differ.

Step-by-step proof

The 1k/1k tep8 recipe (touched by this PR) ends with:

benchmark:
  ...
  tokenizer_mode: "deepseek_v4"
  use_chat_template: true

1k1k/disagg-gb200-1p1d-dep8-dep16.yaml line 125 (NOT in this PR's diff) ends with:
```
benchmark:
  ...
  use_chat_template: false
```
No tokenizer_mode, no prefill numa-bind: true.
1k1k/disagg-gb200-3p1d-dep8-dep16.yaml line 117 (NOT in this PR's diff): identical situation.
.github/configs/nvidia-master.yaml lines 7577-7620 confirm all three are in the same 1k/1k search-space, so a single benchmark sweep at isl=1024,osl=1024 will hit all three recipes at different concurrencies.
Pre-PR all three 1k/1k recipes had use_chat_template: false (consistent). Post-PR, only one was flipped, so this PR introduces the inconsistency.

How to fix

Either:

(Most consistent with the changelog's stated intent) Update both DEP-16 recipes the same way: set benchmark.use_chat_template: true, benchmark.tokenizer_mode: "deepseek_v4", and prefill numa-bind: true (the worker-side tokenizer-mode: deepseek_v4 if not already there).
Or revert the 1k/1k tep8 update for now and leave all three on the old workarounds, matching pre-PR behavior. The 1k/1k recipes have no upstream PR #77 reference anyway, so this is a self-consistent fallback until 1k/1k DEP-16 also gets a chance to be validated end-to-end.

Runs will succeed either way (the recipes are individually valid), so this is a benchmark-correctness / comparability bug, not a runtime crash. Severity: normal.

ywang96 · 2026-04-26T05:45:44Z

+    VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1"
+    VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random"


We need to comment these out in all yamls - it was added to see the most optimal engine perf.

…ION_STRATEGY These were upstream's tools for measuring most-optimal engine perf via randomized routing — disable them so the benchmark exercises the real expert routing path. Applied to every recipe that had them (all 8 new 8k/1k recipes plus the 1k/1k tep8 recipe).

Re-mirror from NVIDIA/srt-slurm aflowers/gb200-dsv4-recipes branch under recipes/vllm/deepseek-v4-pro-sa/ — the SemiAnalysis-curated subset of PR #77. 1k/1k recipes are removed (only 8k/1k is in scope now). Topology changes vs the previous mirror: * drop 1p1d-tep8, 2p1d-c256-c512-c1024, 3p1d-c2048, 3p1d-dep16-40, 7p1d * keep 1p1d-dep8-dep8-16 (concurrencies bumped to 64x128x256x512x1024), 1p4d-tp8, 1p8d-tp8 * add new c4096-offload variants: 2p1d-dep8-dep8, 3p1d-dep8-dep8, 3p1d-dep8-dep16 Other consistency fixes: * dynamo.install: false uniformly (matches -sa/ — assumes pre-installed dynamo in the container) * dynamo.hash 6a159fed... uniformly * model.container set to vllm/vllm-openai:deepseekv4-cu130-dynamo across all 6 recipes so the recipe lookup matches the alias key the launch script registers in srtslurm.yaml from nvidia-master.yaml's image: field * slurm.time_limit + health_check inserted right after setup_script: in a consistent position

…vllm-gb200-v0.20.0 Bump container image to vllm/vllm-openai:v0.20.0-ubuntu2404@sha256:46da022c... in nvidia-master.yaml and across all 6 recipes (keeps the recipe model.container in lockstep with the alias key the launch script registers in srtslurm.yaml). Repoint launch_gb200-nv.sh from aflowers/gb200-dsv4-recipes to aflowers/vllm-gb200-v0.20.0 — the 0.20.0 branch. Per-recipe changes: * Replace dynamo.hash + dynamo.install: false with dynamo.install: true + wheel: "1.2.0.dev20260426". The new container is vanilla vLLM 0.20.0 without dynamo pre-installed, so srtctl installs from the pinned wheel. * Add benchmark.custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer" * Add identity: block at the bottom of every recipe — model repo+revision, container image (sha256), and dynamo+vllm framework versions for reproducibility tracking. * 1p8d recipe: add conc 1 (concurrencies "1x8x16x32x64x128x256x512") and rename to disagg-gb200-1p8d-dep8-tp8-c1-c8-c16-c32-c64-c128-c256-offload.yaml. CONFIG_FILE reference in nvidia-master.yaml updated; conc-list extended to [1, 8, 16, 32, 64, 128, 256, 512].

custom_tokenizer (added in the previous commit) covers sa-bench's DSV4 tokenization; the redundant tokenizer_mode field is no longer needed. The vllm_config.{prefill,decode}.tokenizer-mode worker-side setting is unchanged.

Use just the tag (vllm/vllm-openai:v0.20.0-ubuntu2404) in nvidia-master.yaml image:, every recipe's model.container, every recipe's identity.container.image, and the recipe header comment lines.

The /mnt/numa1/models/deepseek-v4-pro/ stage doesn't carry HF revision metadata (no .huggingface/refs/main, no .cache/huggingface/download/ metadata), so identity.model.revision verification would fail every job with "no HF revision found at /model". Drop the block until the stage is re-populated via huggingface_hub.snapshot_download or the ref marker is planted manually. identity.container and identity.frameworks are preserved.

…amo-vllm

The compute-node-local NVMe path is not visible to the GHA runner host, so srtctl preflight (which runs there) failed with "model path unavailable". Use the Lustre copy instead so preflight resolves the alias to a path the runner can stat.

alec-flowers · 2026-04-28T15:22:28Z

@Oseltamivir we were waiting for the queued job to run when you made your changes. Because of the changes it cancelled the current job, they got shuffled to the back of the queue and now we are waiting again. It now slowed down this submission cycle by another 12 hours.

Can you please hold off on making changes until we are ready to merge?

alec-flowers · 2026-04-28T15:24:23Z

@Oseltamivir
We had also fixed this issue already NVIDIA/srt-slurm@0014864

Switch dsv4-fp4 MODEL_PATH from /mnt/numa1 to /mnt/lustre01
The compute-node-local NVMe path is not visible to the GHA runner host,
so srtctl preflight (which runs there) failed with "model path
unavailable". Use the Lustre copy instead so preflight resolves the
alias to a path the runner can stat.

Oseltamivir · 2026-04-28T16:52:23Z

good morning @alec-flowers

Oseltamivir · 2026-04-28T17:05:02Z

we were waiting for the queued job to run when you made your changes. Because of the changes it cancelled the current job, they got shuffled to the back of the queue and now we are waiting again

@alec-flowers Which queued job were you waiting for to run?

alec-flowers · 2026-04-28T17:06:04Z

we were waiting for the queued job to run when you made your changes. Because of the changes it cancelled the current job, they got shuffled to the back of the queue and now we are waiting again

@alec-flowers Which queued job were you waiting for to run?

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25030685109/job/73319580207?pr=1163

The one that was cancelled when a new commit was pushed.

Oseltamivir · 2026-04-28T17:31:04Z

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25030685109/job/73319580207?pr=1163

@alec-flowers apologies, didn't see it as it wasn't on HEAD. I've cancelled my SGL GB200 run so this can proceed

godspeed

kedarpotdar-nv · 2026-04-29T00:34:48Z

@claude please update PR description

Klaud-Cold · 2026-04-29T00:35:27Z

Claude finished @kedarpotdar-nv's task in 1m 15s —— View job

Updated PR Description

Gather context on PR changes (diff, commits, recipe files)
Analyze changes across all 11 files
Write comprehensive PR description
Update PR on GitHub

PR description has been updated with a full summary covering:

Image & srt-slurm branch upgrade (vLLM v0.20.0)
Three new validated 8k/1k recipes (low-latency, mid-curve, max-tpt) with topology table
Deleted 1k/1k configs and 7P/1D topology
nvidia-master.yaml search-space simplification
Launch script changes (dynamic squash path, re-enabled enroot import)

Oseltamivir

lgtm

Oseltamivir · 2026-04-29T01:13:42Z

    git clone https://github.com/NVIDIA/srt-slurm.git "$SRT_REPO_DIR"
    cd "$SRT_REPO_DIR"
-    git checkout sa-submission-q2-2026
+    git checkout aflowers/vllm-gb200-v0.20.0


Upstream in future

Question on this. I made this branch off of main since it has some extra stuff I wanted to get in there.

The sa-submission-q2-2026, is that where you want these upstreamed? main I guess is a moving target. Will sa-submission-q2-2026 ever make it to main? Or it will live as a legacy fork?

Historically it has been Nvidia's choice, and historically it will live as a legacy fork, as there has to be some changes to accommodate InferenceX runners.

Some changes in sa-submission-q2-2026 will never make it to main, e.g. several parts of lm-eval. Although I will investigate changing InfX to be usable with srt-slurm's main: NVIDIA/srt-slurm#41

Ankur-singh requested a review from a team April 26, 2026 02:40

github-project-automation Bot added this to InferenceMAX Board Apr 26, 2026

Ankur-singh requested review from jgangani and kedarpotdar-nv as code owners April 26, 2026 02:40

Ankur-singh changed the title ~~Re-submit dsv4-fp4-gb200-dynamo-vllm against srt-slurm PR #77~~ [NV] dsv4-fp4-gb200-dynamo-vllm Apr 26, 2026

Ankur-singh added 2 commits April 25, 2026 19:49

Merge remote-tracking branch 'origin/main' into nv/dsv4-fp4-gb200-dyn…

65e804c

…amo-vllm # Conflicts: # perf-changelog.yaml

Revert 1k/1k tep8 recipe changes; leave 1k/1k untouched

925706b

The 1k/1k tep8 numa-bind + chat-template re-enabling is rolled back — 1k/1k stays at the previous local-extrapolation tuning. Updates the perf-changelog entry to reflect that.

Ankur-singh added the sweep-enabled label Apr 26, 2026

claude Bot reviewed Apr 26, 2026

View reviewed changes

ywang96 reviewed Apr 26, 2026

View reviewed changes

Ankur-singh requested a review from Qiaolin-Yu as a code owner April 26, 2026 16:55

Merge branch 'main' into nv/dsv4-fp4-gb200-dynamo-vllm

11c9514

Oseltamivir mentioned this pull request Apr 26, 2026

Replace DSv4 8k1k recipes with NVIDIA/srt-slurm PR #78 #1164

Closed

3 tasks

Ankur-singh added 3 commits April 26, 2026 14:24

Merge branch 'main' into nv/dsv4-fp4-gb200-dynamo-vllm

4b8c5e7

Update perf-changelog.yaml

88c7a2e

Ankur-singh changed the title ~~[NV] dsv4-fp4-gb200-dynamo-vllm~~ [DON'T MERGE] [NV] dsv4-fp4-gb200-dynamo-vllm Apr 27, 2026

Ankur-singh added 2 commits April 27, 2026 17:43

Drop benchmark.tokenizer_mode from all 6 recipes

ed541a7

custom_tokenizer (added in the previous commit) covers sa-bench's DSV4 tokenization; the redundant tokenizer_mode field is no longer needed. The vllm_config.{prefill,decode}.tokenizer-mode worker-side setting is unchanged.

Ankur-singh added the full-sweep-enabled label Apr 28, 2026

Merge branch 'main' into nv/dsv4-fp4-gb200-dynamo-vllm

5182970

Ankur-singh removed the sweep-enabled label Apr 28, 2026

Ankur-singh and others added 5 commits April 27, 2026 19:30

Strip sha256 pin from vllm container references

103957b

Use just the tag (vllm/vllm-openai:v0.20.0-ubuntu2404) in nvidia-master.yaml image:, every recipe's model.container, every recipe's identity.container.image, and the recipe header comment lines.

Merge remote-tracking branch 'origin/main' into nv/dsv4-fp4-gb200-dyn…

4b4ebcd

…amo-vllm

Merge branch 'main' into nv/dsv4-fp4-gb200-dynamo-vllm

dd386a5

alec-flowers and others added 4 commits April 28, 2026 14:40

Trim DSv4 GB200 dynamo-vLLM configs

a1096ba

Merge branch 'main' into nv/dsv4-fp4-gb200-dynamo-vllm

ddca892

Fix perf changelog entry formatting

efcfa65

Restore dynamic GB200 container import

ecdebe9

kedarpotdar-nv approved these changes Apr 29, 2026

View reviewed changes

alec-flowers changed the title ~~[DON'T MERGE] [NV] dsv4-fp4-gb200-dynamo-vllm~~ [NV] dsv4-fp4-gb200-dynamo-vllm Apr 29, 2026

Oseltamivir approved these changes Apr 29, 2026

View reviewed changes

Merge branch 'main' into nv/dsv4-fp4-gb200-dynamo-vllm

762f5c1

Oseltamivir merged commit 8071394 into main Apr 29, 2026
8 of 12 checks passed

Oseltamivir deleted the nv/dsv4-fp4-gb200-dynamo-vllm branch April 29, 2026 01:16

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 29, 2026

This was referenced Apr 29, 2026

[NV] Add GB200 MegaMOE max throughput recipe #1218

Merged

[WIP] [AMD/ROCM] atom minimaxm2.5 fp4 on mi355x #1240

Merged

		VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1"
		VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random"

Conversation

Ankur-singh commented Apr 26, 2026 • edited by Klaud-Cold Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Image & branch update

Recipe changes (8k/1k)

Deleted recipes

nvidia-master.yaml

Other

Upstream references

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

What is happening

Why it matters

Step-by-step proof

How to fix

Uh oh!

ywang96 Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

alec-flowers commented Apr 28, 2026

Uh oh!

alec-flowers commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Oseltamivir commented Apr 28, 2026

Uh oh!

Oseltamivir commented Apr 28, 2026

Uh oh!

alec-flowers commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Oseltamivir commented Apr 28, 2026

Uh oh!

kedarpotdar-nv commented Apr 29, 2026

Uh oh!

Klaud-Cold commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updated PR Description

Uh oh!

Oseltamivir left a comment

Choose a reason for hiding this comment

Uh oh!

Oseltamivir Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

alec-flowers Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Oseltamivir Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Ankur-singh commented Apr 26, 2026 •

edited by Klaud-Cold

Loading

alec-flowers commented Apr 28, 2026 •

edited

Loading

alec-flowers commented Apr 28, 2026 •

edited

Loading

Klaud-Cold commented Apr 29, 2026 •

edited

Loading