[NVIDIA] bug fix: add TP=2,4 to B200, just as mi355 has by cquil11 · Pull Request #77 · SemiAnalysisAI/InferenceX

cquil11 · 2025-10-01T17:58:00Z

To see full Pareto frontier and match behavior of AMD GPUs.

Re-mirror from NVIDIA/srt-slurm aflowers/gb200-dsv4-recipes branch under recipes/vllm/deepseek-v4-pro-sa/ — the SemiAnalysis-curated subset of PR #77. 1k/1k recipes are removed (only 8k/1k is in scope now). Topology changes vs the previous mirror: * drop 1p1d-tep8, 2p1d-c256-c512-c1024, 3p1d-c2048, 3p1d-dep16-40, 7p1d * keep 1p1d-dep8-dep8-16 (concurrencies bumped to 64x128x256x512x1024), 1p4d-tp8, 1p8d-tp8 * add new c4096-offload variants: 2p1d-dep8-dep8, 3p1d-dep8-dep8, 3p1d-dep8-dep16 Other consistency fixes: * dynamo.install: false uniformly (matches -sa/ — assumes pre-installed dynamo in the container) * dynamo.hash 6a159fed... uniformly * model.container set to vllm/vllm-openai:deepseekv4-cu130-dynamo across all 6 recipes so the recipe lookup matches the alias key the launch script registers in srtslurm.yaml from nvidia-master.yaml's image: field * slurm.time_limit + health_check inserted right after setup_script: in a consistent position

* Re-submit dsv4-fp4-gb200-dynamo-vllm against srt-slurm aflowers/gb200-dsv4-recipes (PR #77) Repoint launch_gb200-nv.sh to NVIDIA/srt-slurm@aflowers/gb200-dsv4-recipes, which supersedes #71 and ships the vllm_numa_bind_hash_fix.py patch and sa-bench DSV4 tokenizer support — so numa-bind, benchmark.use_chat_template, and benchmark.tokenizer_mode no longer have to be stripped from recipes. 8k/1k search-space expanded from 3 topologies to 8: adds 1p4d/1p8d pure-TP decode (offload), 1p1d/2p1d/3p1d DEP-8 decode, and a 3p1d-dep16-40 wide decode shape. 1k/1k topologies unchanged (no upstream 1k/1k counterpart); 1k/1k tep8 also re-enables numa-bind + chat template to stay consistent. Local recipe deltas vs upstream are limited to: model.path alias rename deepseekv4-fp4 -> deepseek-v4-pro (matches SRT_SLURM_MODEL_PREFIX), container kept on the floating :deepseekv4-cu130 tag, slurm.time_limit added, and health_check.max_attempts bumped 360 -> 1440 for cold-cache loads. * Revert 1k/1k tep8 recipe changes; leave 1k/1k untouched The 1k/1k tep8 numa-bind + chat-template re-enabling is rolled back — 1k/1k stays at the previous local-extrapolation tuning. Updates the perf-changelog entry to reflect that. * Comment out VLLM_RANDOMIZE_DP_DUMMY_INPUTS / VLLM_MOE_ROUTING_SIMULATION_STRATEGY These were upstream's tools for measuring most-optimal engine perf via randomized routing — disable them so the benchmark exercises the real expert routing path. Applied to every recipe that had them (all 8 new 8k/1k recipes plus the 1k/1k tep8 recipe). * Switch to deepseek-v4-pro-sa SA-curated subset; drop 1k/1k Re-mirror from NVIDIA/srt-slurm aflowers/gb200-dsv4-recipes branch under recipes/vllm/deepseek-v4-pro-sa/ — the SemiAnalysis-curated subset of PR #77. 1k/1k recipes are removed (only 8k/1k is in scope now). Topology changes vs the previous mirror: * drop 1p1d-tep8, 2p1d-c256-c512-c1024, 3p1d-c2048, 3p1d-dep16-40, 7p1d * keep 1p1d-dep8-dep8-16 (concurrencies bumped to 64x128x256x512x1024), 1p4d-tp8, 1p8d-tp8 * add new c4096-offload variants: 2p1d-dep8-dep8, 3p1d-dep8-dep8, 3p1d-dep8-dep16 Other consistency fixes: * dynamo.install: false uniformly (matches -sa/ — assumes pre-installed dynamo in the container) * dynamo.hash 6a159fed... uniformly * model.container set to vllm/vllm-openai:deepseekv4-cu130-dynamo across all 6 recipes so the recipe lookup matches the alias key the launch script registers in srtslurm.yaml from nvidia-master.yaml's image: field * slurm.time_limit + health_check inserted right after setup_script: in a consistent position * Update perf-changelog.yaml * Switch to vLLM 0.20.0 + dynamo wheel pin; rebase recipes on aflowers/vllm-gb200-v0.20.0 Bump container image to vllm/vllm-openai:v0.20.0-ubuntu2404@sha256:46da022c... in nvidia-master.yaml and across all 6 recipes (keeps the recipe model.container in lockstep with the alias key the launch script registers in srtslurm.yaml). Repoint launch_gb200-nv.sh from aflowers/gb200-dsv4-recipes to aflowers/vllm-gb200-v0.20.0 — the 0.20.0 branch. Per-recipe changes: * Replace dynamo.hash + dynamo.install: false with dynamo.install: true + wheel: "1.2.0.dev20260426". The new container is vanilla vLLM 0.20.0 without dynamo pre-installed, so srtctl installs from the pinned wheel. * Add benchmark.custom_tokenizer: "sa_bench_tokenizers.vllm_deepseek_v4.VLLMDeepseekV4Tokenizer" * Add identity: block at the bottom of every recipe — model repo+revision, container image (sha256), and dynamo+vllm framework versions for reproducibility tracking. * 1p8d recipe: add conc 1 (concurrencies "1x8x16x32x64x128x256x512") and rename to disagg-gb200-1p8d-dep8-tp8-c1-c8-c16-c32-c64-c128-c256-offload.yaml. CONFIG_FILE reference in nvidia-master.yaml updated; conc-list extended to [1, 8, 16, 32, 64, 128, 256, 512]. * Drop benchmark.tokenizer_mode from all 6 recipes custom_tokenizer (added in the previous commit) covers sa-bench's DSV4 tokenization; the redundant tokenizer_mode field is no longer needed. The vllm_config.{prefill,decode}.tokenizer-mode worker-side setting is unchanged. * Strip sha256 pin from vllm container references Use just the tag (vllm/vllm-openai:v0.20.0-ubuntu2404) in nvidia-master.yaml image:, every recipe's model.container, every recipe's identity.container.image, and the recipe header comment lines. * Drop identity.model from all 6 recipes The /mnt/numa1/models/deepseek-v4-pro/ stage doesn't carry HF revision metadata (no .huggingface/refs/main, no .cache/huggingface/download/ metadata), so identity.model.revision verification would fail every job with "no HF revision found at /model". Drop the block until the stage is re-populated via huggingface_hub.snapshot_download or the ref marker is planted manually. identity.container and identity.frameworks are preserved. * Switch dsv4-fp4 MODEL_PATH from /mnt/numa1 to /mnt/lustre01 The compute-node-local NVMe path is not visible to the GHA runner host, so srtctl preflight (which runs there) failed with "model path unavailable". Use the Lustre copy instead so preflight resolves the alias to a path the runner can stat. * Trim DSv4 GB200 dynamo-vLLM configs * Fix perf changelog entry formatting * Restore dynamic GB200 container import --------- Co-authored-by: Oseltamivir <bryansg2013@gmail.com> Co-authored-by: Bryan Shan <58582368+Oseltamivir@users.noreply.github.com> Co-authored-by: Alec Flowers <aflowers@nvidia.com> Co-authored-by: Alec <35311602+alec-flowers@users.noreply.github.com>

cquil11 requested review from functionstackx and kedarpotdar-nv October 1, 2025 17:58

cquil11 changed the title ~~chore: add 2, 4 tp to b200 runs for llama~~ bug fix: add TP=2,4 to B200, just as mi355 has Oct 1, 2025

adding tp 2 4 to b200 to match mi355 tp

c90e7b7

cquil11 force-pushed the b200-tp-add branch from c5cfead to c90e7b7 Compare October 1, 2025 18:33

cquil11 merged commit 1e97a5e into main Oct 1, 2025

cquil11 deleted the b200-tp-add branch October 1, 2025 18:35

cquil11 added the NVIDIA label Apr 8, 2026

cquil11 changed the title ~~bug fix: add TP=2,4 to B200, just as mi355 has~~ [NVIDIA] bug fix: add TP=2,4 to B200, just as mi355 has Apr 8, 2026

claude Bot mentioned this pull request Apr 26, 2026

[NV] dsv4-fp4-gb200-dynamo-vllm #1163

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] bug fix: add TP=2,4 to B200, just as mi355 has#77

[NVIDIA] bug fix: add TP=2,4 to B200, just as mi355 has#77
cquil11 merged 1 commit intomainfrom
b200-tp-add

cquil11 commented Oct 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cquil11 commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cquil11 commented Oct 1, 2025 •

edited

Loading