Skip to content

[NVIDIA] Add Kimi K2.5 NVFP4 GB200 disaggregated vLLM benchmarks via Dynamo#1008

Merged
functionstackx merged 2 commits intomainfrom
nv/kimi2.5-disagg-gb200-8k1k-1k1k
Apr 7, 2026
Merged

[NVIDIA] Add Kimi K2.5 NVFP4 GB200 disaggregated vLLM benchmarks via Dynamo#1008
functionstackx merged 2 commits intomainfrom
nv/kimi2.5-disagg-gb200-8k1k-1k1k

Conversation

@nlevin-ui
Copy link
Copy Markdown
Collaborator

  • New framework: dynamo-vllm (Dynamo frontend + vLLM backend)
  • 6 configs: 2x 1k1k + 4x 8k1k with varying prefill/decode ratios
  • Recipes sourced from NVIDIA/srt-slurm branch sa-submission-q2-2026
  • Runner script updated to clone NVIDIA/srt-slurm and map vLLM container

- New framework: dynamo-vllm (Dynamo frontend + vLLM backend)
- 6 configs: 2x 1k1k + 4x 8k1k with varying prefill/decode ratios
- Recipes sourced from NVIDIA/srt-slurm branch sa-submission-q2-2026
- Runner script updated to clone NVIDIA/srt-slurm and map vLLM container

Made-with: Cursor
@nlevin-ui nlevin-ui force-pushed the nv/kimi2.5-disagg-gb200-8k1k-1k1k branch from 4e4226e to 3e358a8 Compare April 6, 2026 23:34
@nlevin-ui nlevin-ui marked this pull request as ready for review April 6, 2026 23:41
@nlevin-ui nlevin-ui requested a review from a team April 6, 2026 23:41
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 6, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 6, 2026

Claude finished @nlevin-ui's task in 1m 19s —— View job


Review of PR #1008

  • Gather context and read changed files
  • Validate model prefixes in config keys
  • Validate container image accessibility
  • Validate enroot import in launch script
  • Validate perf-changelog updated
  • Check benchmark script style
  • Post review summary

LGTM — no blocking issues found.

Validation summary:

  • Container image: vllm/vllm-openai:v0.18.0-cu130 is publicly accessible on Docker Hub ✅
  • Enroot import: Script already has enroot import -o $SQUASH_FILE docker://$IMAGE at line 58 ✅
  • Perf-changelog: Updated with detailed entry ✅
  • Model prefix: kimik2.5 is consistent with existing Kimi K2.5 configs already in nvidia-master.yaml
  • Config structure: 6 disaggregated configs (2× 1k1k, 4× 8k1k) with varying prefill/decode ratios look well-structured ✅
  • Launch script: Clean branching for dynamo-vllm framework with correct srt-slurm repo (NVIDIA/srt-slurm) and branch (sa-submission-q2-2026) ✅
  • Container mapping: "${IMAGE}": ${SQUASH_FILE} at line 172 correctly maps the vLLM Docker image name so srt-slurm recipes can resolve it ✅

Comment thread runners/launch_gb200-nv.sh Outdated
Comment thread .github/configs/nvidia-master.yaml
The error now reports both model prefix and precision so users can
identify which field is wrong.

Made-with: Cursor
@functionstackx
Copy link
Copy Markdown
Contributor

thanks @nlevin-ui can u debug the failing pr validation?

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Apr 7, 2026

@claude Verify that the parameters (EP, DP, TP, etc) in nvidia-master.yaml match those in the linked recipes from CONFIG_FILES. For all added entries.

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Apr 7, 2026

Claude finished @cquil11's task in 2m 18s —— View job


Parameter verification: nvidia-master.yaml vs upstream recipes

  • Read added entries in nvidia-master.yaml
  • Identify all 6 CONFIG_FILE recipe paths
  • Fetch each recipe from NVIDIA/srt-slurm branch sa-submission-q2-2026
  • Compare EP, DP, TP, num-worker, and concurrency parameters
  • Post verification summary

All 6 configs match their upstream recipes. I fetched each recipe YAML from NVIDIA/srt-slurm@sa-submission-q2-2026 and verified the parallelism parameters (num-worker, tp, ep, dp-attn) and concurrency lists against the nvidia-master.yaml entries.

The mapping used: tpgpus_per_prefill/gpus_per_decode, ep ↔ effective expert-parallel size (from data-parallel-size when using DEP, or from tensor-parallel-size when using TEP), num-workerprefill_workers/decode_workers, dp-attn ↔ whether data-parallel is active.

1. 1k1k/disagg-gb200-1p1d-dep4-dep16.yaml

Parameter nvidia-master.yaml Recipe
Prefill num-worker / tp / ep / dp-attn 1 / 4 / 4 / true workers=1, gpus=4, TP=1, DP=4, EP=true
Decode num-worker / tp / ep / dp-attn 1 / 16 / 16 / true workers=1, gpus=16, TP=1, DP=16, EP=true
conc-list [256, 512, 1024, 2048, 3072, 4096] 256x512x1024x2048x3072x4096

2. 1k1k/disagg-gb200-1p4d-dep4-tep4.yaml

Parameter nvidia-master.yaml Recipe
Prefill num-worker / tp / ep / dp-attn 1 / 4 / 4 / true workers=1, gpus=4, TP=1, DP=4, EP=true
Decode num-worker / tp / ep / dp-attn 4 / 4 / 4 / false workers=4, gpus=4, TP=4, EP=true (no DP)
conc-list [4, 8, 16, 32, 64, 128] 4x8x16x32x64x128

3. 8k1k/disagg-gb200-1p4d-dep4-tep4.yaml

Parameter nvidia-master.yaml Recipe
Prefill num-worker / tp / ep / dp-attn 1 / 4 / 4 / true workers=1, gpus=4, TP=1, DP=4, EP=true
Decode num-worker / tp / ep / dp-attn 4 / 4 / 4 / false workers=4, gpus=4, TP=4, EP=true (no DP)
conc-list [4, 8, 16, 32, 128] 4x8x16x32x128

Note: The 8k1k 1P4D concurrency list intentionally omits 64, matching the upstream recipe exactly. This confirms the recipe author's deliberate choice (as @nlevin-ui stated in the earlier review comment).

4. 8k1k/disagg-gb200-3p1d-dep4-dep16.yaml

Parameter nvidia-master.yaml Recipe
Prefill num-worker / tp / ep / dp-attn 3 / 4 / 4 / true workers=3, gpus=4, TP=1, DP=4, EP=true
Decode num-worker / tp / ep / dp-attn 1 / 16 / 16 / true workers=1, gpus=16, TP=1, DP=16, EP=true
conc-list [512, 1024] 512x1024

5. 8k1k/disagg-gb200-5p1d-dep4-dep8.yaml

Parameter nvidia-master.yaml Recipe
Prefill num-worker / tp / ep / dp-attn 5 / 4 / 4 / true workers=5, gpus=4, TP=1, DP=4, EP=true
Decode num-worker / tp / ep / dp-attn 1 / 8 / 8 / true workers=1, gpus=8, TP=1, DP=8, EP=true
conc-list [2048] 2048

6. 8k1k/disagg-gb200-6p1d-dep4-dep16.yaml

Parameter nvidia-master.yaml Recipe
Prefill num-worker / tp / ep / dp-attn 6 / 4 / 4 / true workers=6, gpus=4, TP=1, DP=4, EP=true
Decode num-worker / tp / ep / dp-attn 1 / 16 / 16 / true workers=1, gpus=16, TP=1, DP=16, EP=true
conc-list [3072, 4096] 3072x4096

Result: 6/6 configs verified — no mismatches found.


View job

@nlevin-ui
Copy link
Copy Markdown
Collaborator Author

thanks @nlevin-ui can u debug the failing pr validation?

@functionstackx can you be more specific? I don't see PR validation failing beyond what I had already fixed prior to your comment. I might just be missing something obvious.

@functionstackx
Copy link
Copy Markdown
Contributor

thanks @nlevin-ui can u debug the failing pr validation?

@functionstackx can you be more specific? I don't see PR validation failing beyond what I had already fixed prior to your comment. I might just be missing something obvious.

nvm, i see that attempt #2 re-run has fixed it https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24058086986?pr=1008 thanks guys!

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@functionstackx functionstackx merged commit 235b232 into main Apr 7, 2026
53 of 54 checks passed
@functionstackx functionstackx deleted the nv/kimi2.5-disagg-gb200-8k1k-1k1k branch April 7, 2026 19:33
@cquil11 cquil11 changed the title Add Kimi K2.5 NVFP4 GB200 disaggregated vLLM benchmarks via Dynamo [NVIDIA] Add Kimi K2.5 NVFP4 GB200 disaggregated vLLM benchmarks via Dynamo Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

6 participants