Skip to content

[NV] H100 FP8 Disagg DSR1 1k1k, 8k1k (STP + MTP)#651

Merged
kedarpotdar-nv merged 12 commits intomainfrom
nv/h100-disagg-trtllm-fp8
Feb 7, 2026
Merged

[NV] H100 FP8 Disagg DSR1 1k1k, 8k1k (STP + MTP)#651
kedarpotdar-nv merged 12 commits intomainfrom
nv/h100-disagg-trtllm-fp8

Conversation

@nlevin-ui
Copy link
Copy Markdown
Collaborator

Update H100 section with verified Pareto-optimal points:

1k1k ISL/OSL:

  • MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229)
  • STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916)

8k1k ISL/OSL (new):

  • MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154)
  • STP: 5 configs (conc: 6, 9, 30, 154, 308)

All configs use TP=16, EP=16 across 2 nodes (16 GPUs total).

@functionstackx
Copy link
Copy Markdown
Contributor

@claude review

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Feb 5, 2026

Claude finished @functionstackx's task in 2m 51s —— View job


Review: PR #651 — H100 FP8 Disagg DSR1 Configs

  • Gather context and read changed files
  • Review YAML structure and consistency
  • Check configuration values and parameters
  • Post review feedback

Summary

This PR adds a new dsr1-fp8-h100-dynamo-trt config key with 29 Pareto-optimal disaggregated inference configurations for H100 multinode (TP=16, EP=16 across 2 nodes). The structure mirrors the existing dsr1-fp8-h200-dynamo-trt config.


Findings

Issue: Concurrency ordering is non-monotonic in 1k1k MTP section

The 1k1k MTP configs have conc=615 before conc=462 (lines 1664-1693), breaking the ascending order used everywhere else in this file and in the H200 reference config.

Current order: 6, 9, 30, 60, 117, 231, 615, 462, 1229
Expected order: 6, 9, 30, 60, 117, 231, 462, 615, 1229

The PR description lists them in the correct sorted order (462, 615), so this appears to be a copy/paste ordering issue in the YAML. While it may not affect functionality (configs are likely processed independently), it hurts readability and deviates from the consistent ascending pattern used in every other section.

Fix this →

Note: Missing perf-changelog.yaml entry

This PR adds an entirely new config key (dsr1-fp8-h100-dynamo-trt) but does not add a corresponding entry to perf-changelog.yaml. Per repo conventions, new config keys that affect performance benchmarks should be documented there. An entry like this would be appropriate:

- config-keys:
    - dsr1-fp8-h100-dynamo-trt
  description:
    - "Add H100 FP8 Disagg DSR1 Dynamo TRT-LLM multinode configurations"
    - "Image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1"
    - "1k1k: 18 scenarios (9 MTP, 9 STP) with verified Pareto-optimal points"
    - "8k1k: 11 scenarios (6 MTP, 5 STP) for long context workloads"
    - "TP=16, EP=16 across 2 nodes (16 GPUs total)"
  pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/651

Fix this →


Cross-validation Results (all pass)

Check Result
YAML validity Valid
Config count matches PR description (9+9+6+5=29) Pass
CONFIG_FILE paths match ISL/OSL (1k1k/8k1k) Pass
CONFIG_FILE paths match MTP/STP (mtp/stp subdirectories, mtp3 vs mtp0) Pass
ctx{N} in filenames matches prefill.num-worker Pass
gen{N} in filenames matches decode.num-worker Pass
tep/dep in filenames matches decode.dp-attn Pass
TP size in filenames matches config tp: 16 Pass
Image tag consistent with H200 config (0.8.1.post1) Pass
multinode: true and disagg: true present Pass
Runner set to h100-multinode-slurm Pass

Architecture observations

  • Worker ratios vary sensibly with concurrency: low-conc uses 1P+3D (prefill-light), high-conc shifts to 2P+1D or 1P+1D (decode-light)
  • dp-attn transitions from false to true on decode workers as concurrency increases and topology shifts from TEP to DEP — consistent with the recipe naming
  • 8k1k MTP conc=77 and conc=78 are distinct Pareto points with different topologies (1P+1D DEP vs 1P+2D TEP) despite near-identical concurrency values — this is correct

| Branch: nv/h100-disagg-trtllm-fp8

Comment thread .github/configs/nvidia-master.yaml Outdated
@functionstackx
Copy link
Copy Markdown
Contributor

hi @nlevin-ui @kedarpotdar-nv

i am removing sweep-enabled since the PR isnt fixed yet. feel free to readd the tag when it is ready

@functionstackx
Copy link
Copy Markdown
Contributor

@pr-claude review plz

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Feb 5, 2026

Claude finished @functionstackx's task in 2m 34s —— View job


Review: PR #651 — H100 FP8 Disagg DSR1 Configs

  • Gather context and read changed files
  • Validate YAML structure and config keys
  • Validate container images and launch script
  • Check perf-changelog.yaml
  • Post review feedback

Summary

Found 2 blocking bugs in the launch script that will prevent any benchmark from running, plus 1 warning and 1 missing changelog entry.

Blocking Issues

# Severity File Issue
1 🔴 BLOCKING runners/launch_h100-dgxc-slurm.sh:44 MODEL_PREFIX check compares against "DeepSeek-R1-0528" but the workflow passes "dsr1" (from model-prefix: dsr1). Every job will exit immediately. All other launch scripts check the short prefix.
2 🔴 BLOCKING runners/launch_h100-dgxc-slurm.sh:39 Hardcoded squash file path (dynamo-trtllm-mnnvl-fix.sqsh) — not derived from IMAGE and no enroot import docker:// step. Not reproducible outside the specific cluster.
3 🔴 BLOCKING perf-changelog.yaml nvidia-master.yaml adds a new config key (dsr1-fp8-h100-dynamo-trt) but perf-changelog.yaml was not updated. Per repo conventions, master config changes require a changelog entry.

Warnings

# Severity File Issue
4 🟡 WARNING runners/launch_h100-dgxc-slurm.sh:14 Checks out trtllm-h100 branch but all YAML recipe comments reference sa-submission-q1-2026 branch — mismatch needs clarification
5 🟡 WARNING .github/configs/nvidia-master.yaml:1664-1693 1k1k MTP concurrency ordering is non-monotonic (615 before 462) — already flagged in previous review

What looks good

  • Config key dsr1-fp8-h100-dynamo-trt uses valid model prefix (dsr1)
  • Image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post1 is publicly accessible on NGC
  • Config count matches PR description (9+9+6+5 = 29)
  • Worker ratios and topology transitions are sensible across concurrency levels
  • YAML structure mirrors the existing H200 config correctly

View job | Branch: nv/h100-disagg-trtllm-fp8

Comment thread runners/launch_h100-dgxc-slurm.sh
Comment thread runners/launch_h100-dgxc-slurm.sh Outdated
Comment thread runners/launch_h100-dgxc-slurm.sh Outdated
Comment thread .github/configs/nvidia-master.yaml Outdated
nlevin-ui and others added 9 commits February 6, 2026 22:21
Update H100 section with verified Pareto-optimal points:

1k1k ISL/OSL:
- MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229)
- STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916)

8k1k ISL/OSL (new):
- MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154)
- STP: 5 configs (conc: 6, 9, 30, 154, 308)

All configs use TP=16, EP=16 across 2 nodes (16 GPUs total).
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Use consistent sed-based path derivation for all container images.
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Only dsr1-fp8-h100-dynamo-trt should use 0.8.1.post3.
Revert B200, B300, and H200 configs back to 0.8.1.post1.

Co-authored-by: Cursor <cursoragent@cursor.com>
Documents the new H100 FP8 disaggregated TRT-LLM configurations
using tensorrtllm-runtime:0.8.1.post3 container.

Co-authored-by: Cursor <cursoragent@cursor.com>
@nlevin-ui nlevin-ui force-pushed the nv/h100-disagg-trtllm-fp8 branch from b7a3532 to 3911118 Compare February 6, 2026 22:22
@nlevin-ui nlevin-ui marked this pull request as ready for review February 6, 2026 22:31
@nlevin-ui nlevin-ui requested a review from a team as a code owner February 6, 2026 22:31
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Feb 6, 2026

Claude finished @nlevin-ui's task in 5m 7s —— View job


Re-review: PR #651 — H100 FP8 Disagg DSR1 Configs

  • Gather context and read changed files
  • Check previous review findings against current code
  • Validate launch script and configs
  • Post review feedback

Previous issues resolved

  • Concurrency ordering (462/615) — Fixed. Now in correct ascending order.
  • Hardcoded squash file — Fixed. SQUASH_FILE is now derived from IMAGE.
  • Branch mismatch — Fixed. H100 script now checks out sa-submission-q1-2026, matching YAML comments.
  • Missing perf-changelog entry — Fixed. Entry added for dsr1-fp8-h100-dynamo-trt.

Remaining issues

# Severity File Issue
1 🔴 BLOCKING nvidia-master.yaml:1577 model-prefix: DeepSeek-R1-0528 — invalid model prefix. Every other DSR1 config uses dsr1. The frontend only supports dsr1 and gptoss prefixes. Results will not display.
2 🔴 BLOCKING launch_h100-dgxc-slurm.sh:40 MODEL_PREFIX check matches "DeepSeek-R1-0528" — must be updated to "dsr1" after fixing issue #1, otherwise all jobs exit immediately.
3 🟡 WARNING launch_h100-dgxc-slurm.sh:38 No enroot import docker://$IMAGE step. The B200 SLURM script includes this; the H200 SLURM script also omits it. May be a cluster-specific difference but reduces reproducibility.

Issues #1 and #2 are coupled — the YAML model-prefix and launch script check must agree. The fix is: change model-prefix to dsr1 in the YAML, and change the launch script check to "dsr1".


| Branch: nv/h100-disagg-trtllm-fp8

dsr1-fp8-h100-dynamo-trt:
image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3
model: deepseek-ai/DeepSeek-R1-0528
model-prefix: DeepSeek-R1-0528
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 BLOCKING: Invalid model prefix DeepSeek-R1-0528. The frontend only supports dsr1 and gptoss as model prefixes. Every other DeepSeek config in this file uses model-prefix: dsr1 (e.g., dsr1-fp8-h200-dynamo-trt at line 1092). Using DeepSeek-R1-0528 will prevent the frontend from displaying benchmark results for this config.

Why it matters: Results will be collected but won't appear on the frontend dashboard, making this config effectively invisible to users.

Fix: Change the model-prefix to dsr1 (matching H200 and all other DSR1 configs), and update the launch script's MODEL_PREFIX check accordingly:

Suggested change
model-prefix: DeepSeek-R1-0528
model-prefix: dsr1

Comment on lines +40 to +44
if [[ $MODEL_PREFIX == "DeepSeek-R1-0528" ]]; then
export MODEL_PATH="/mnt/numa1/shared/models/dsr1-fp8"
export SERVED_MODEL_NAME="DeepSeek-R1-0528"
else
echo "Unsupported model prefix: $MODEL_PREFIX. Supported prefixes are: DeepSeek-R1-0528"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 BLOCKING: Must check for "dsr1" (not "DeepSeek-R1-0528") to be consistent with the corrected model-prefix in the YAML config and match the convention used by all other launch scripts (launch_h200-dgxc-slurm.sh, launch_b300-nv.sh, etc.).

Why it matters: After fixing the YAML model-prefix to dsr1, this check will fail and every job will exit with error code 1.

Fix:

Suggested change
if [[ $MODEL_PREFIX == "DeepSeek-R1-0528" ]]; then
export MODEL_PATH="/mnt/numa1/shared/models/dsr1-fp8"
export SERVED_MODEL_NAME="DeepSeek-R1-0528"
else
echo "Unsupported model prefix: $MODEL_PREFIX. Supported prefixes are: DeepSeek-R1-0528"
if [[ $MODEL_PREFIX == "dsr1" ]]; then
export MODEL_PATH="/mnt/numa1/shared/models/dsr1-fp8"
export SERVED_MODEL_NAME="DeepSeek-R1-0528"
else
echo "Unsupported model prefix: $MODEL_PREFIX. Supported prefixes are: dsr1"

Comment thread runners/launch_h100-dgxc-slurm.sh Outdated
nlevin-ui and others added 2 commits February 6, 2026 15:48
Change SQUASH_FILE path from /mnt/nfs/slurm-shared/containers/ to
/mnt/nfs/sa-shared/containers/ to match cluster configuration.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, feel free to merge if the single sweep error is transient

@kedarpotdar-nv kedarpotdar-nv changed the title [WIP - DRAFT ] [NV] H100 FP8 Disagg DSR1 1k1k, 8k1k (STP + MTP) [NV] H100 FP8 Disagg DSR1 1k1k, 8k1k (STP + MTP) Feb 7, 2026
…trt changelog to end

Co-authored-by: Cursor <cursoragent@cursor.com>
@kedarpotdar-nv kedarpotdar-nv merged commit 32a4845 into main Feb 7, 2026
10 of 40 checks passed
@kedarpotdar-nv kedarpotdar-nv deleted the nv/h100-disagg-trtllm-fp8 branch February 7, 2026 22:28
@functionstackx functionstackx restored the nv/h100-disagg-trtllm-fp8 branch February 8, 2026 00:45
@functionstackx functionstackx deleted the nv/h100-disagg-trtllm-fp8 branch February 18, 2026 04:16
@claude claude Bot mentioned this pull request Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

5 participants