[NV] H100 FP8 Disagg DSR1 1k1k, 8k1k (STP + MTP)#651
Conversation
|
@claude review |
|
Claude finished @functionstackx's task in 2m 51s —— View job Review: PR #651 — H100 FP8 Disagg DSR1 Configs
SummaryThis PR adds a new FindingsIssue: Concurrency ordering is non-monotonic in 1k1k MTP sectionThe 1k1k MTP configs have conc=615 before conc=462 (lines 1664-1693), breaking the ascending order used everywhere else in this file and in the H200 reference config. Current order: The PR description lists them in the correct sorted order ( Note: Missing
|
| Check | Result |
|---|---|
| YAML validity | Valid |
| Config count matches PR description (9+9+6+5=29) | Pass |
CONFIG_FILE paths match ISL/OSL (1k1k/8k1k) |
Pass |
CONFIG_FILE paths match MTP/STP (mtp/stp subdirectories, mtp3 vs mtp0) |
Pass |
ctx{N} in filenames matches prefill.num-worker |
Pass |
gen{N} in filenames matches decode.num-worker |
Pass |
tep/dep in filenames matches decode.dp-attn |
Pass |
TP size in filenames matches config tp: 16 |
Pass |
Image tag consistent with H200 config (0.8.1.post1) |
Pass |
multinode: true and disagg: true present |
Pass |
Runner set to h100-multinode-slurm |
Pass |
Architecture observations
- Worker ratios vary sensibly with concurrency: low-conc uses 1P+3D (prefill-light), high-conc shifts to 2P+1D or 1P+1D (decode-light)
- dp-attn transitions from
falsetotrueon decode workers as concurrency increases and topology shifts from TEP to DEP — consistent with the recipe naming - 8k1k MTP conc=77 and conc=78 are distinct Pareto points with different topologies (1P+1D DEP vs 1P+2D TEP) despite near-identical concurrency values — this is correct
|
i am removing sweep-enabled since the PR isnt fixed yet. feel free to readd the tag when it is ready |
|
@pr-claude review plz |
|
Claude finished @functionstackx's task in 2m 34s —— View job Review: PR #651 — H100 FP8 Disagg DSR1 Configs
SummaryFound 2 blocking bugs in the launch script that will prevent any benchmark from running, plus 1 warning and 1 missing changelog entry. Blocking Issues
Warnings
What looks good
|
Update H100 section with verified Pareto-optimal points: 1k1k ISL/OSL: - MTP: 9 configs (conc: 6, 9, 30, 60, 117, 231, 462, 615, 1229) - STP: 9 configs (conc: 6, 9, 30, 60, 231, 462, 924, 1845, 4916) 8k1k ISL/OSL (new): - MTP: 6 configs (conc: 6, 9, 30, 77, 78, 154) - STP: 5 configs (conc: 6, 9, 30, 154, 308) All configs use TP=16, EP=16 across 2 nodes (16 GPUs total).
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Use consistent sed-based path derivation for all container images.
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Only dsr1-fp8-h100-dynamo-trt should use 0.8.1.post3. Revert B200, B300, and H200 configs back to 0.8.1.post1. Co-authored-by: Cursor <cursoragent@cursor.com>
Documents the new H100 FP8 disaggregated TRT-LLM configurations using tensorrtllm-runtime:0.8.1.post3 container. Co-authored-by: Cursor <cursoragent@cursor.com>
b7a3532 to
3911118
Compare
|
Claude finished @nlevin-ui's task in 5m 7s —— View job Re-review: PR #651 — H100 FP8 Disagg DSR1 Configs
Previous issues resolved
Remaining issues
Issues #1 and #2 are coupled — the YAML model-prefix and launch script check must agree. The fix is: change model-prefix to |
| dsr1-fp8-h100-dynamo-trt: | ||
| image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1.post3 | ||
| model: deepseek-ai/DeepSeek-R1-0528 | ||
| model-prefix: DeepSeek-R1-0528 |
There was a problem hiding this comment.
🔴 BLOCKING: Invalid model prefix DeepSeek-R1-0528. The frontend only supports dsr1 and gptoss as model prefixes. Every other DeepSeek config in this file uses model-prefix: dsr1 (e.g., dsr1-fp8-h200-dynamo-trt at line 1092). Using DeepSeek-R1-0528 will prevent the frontend from displaying benchmark results for this config.
Why it matters: Results will be collected but won't appear on the frontend dashboard, making this config effectively invisible to users.
Fix: Change the model-prefix to dsr1 (matching H200 and all other DSR1 configs), and update the launch script's MODEL_PREFIX check accordingly:
| model-prefix: DeepSeek-R1-0528 | |
| model-prefix: dsr1 |
| if [[ $MODEL_PREFIX == "DeepSeek-R1-0528" ]]; then | ||
| export MODEL_PATH="/mnt/numa1/shared/models/dsr1-fp8" | ||
| export SERVED_MODEL_NAME="DeepSeek-R1-0528" | ||
| else | ||
| echo "Unsupported model prefix: $MODEL_PREFIX. Supported prefixes are: DeepSeek-R1-0528" |
There was a problem hiding this comment.
🔴 BLOCKING: Must check for "dsr1" (not "DeepSeek-R1-0528") to be consistent with the corrected model-prefix in the YAML config and match the convention used by all other launch scripts (launch_h200-dgxc-slurm.sh, launch_b300-nv.sh, etc.).
Why it matters: After fixing the YAML model-prefix to dsr1, this check will fail and every job will exit with error code 1.
Fix:
| if [[ $MODEL_PREFIX == "DeepSeek-R1-0528" ]]; then | |
| export MODEL_PATH="/mnt/numa1/shared/models/dsr1-fp8" | |
| export SERVED_MODEL_NAME="DeepSeek-R1-0528" | |
| else | |
| echo "Unsupported model prefix: $MODEL_PREFIX. Supported prefixes are: DeepSeek-R1-0528" | |
| if [[ $MODEL_PREFIX == "dsr1" ]]; then | |
| export MODEL_PATH="/mnt/numa1/shared/models/dsr1-fp8" | |
| export SERVED_MODEL_NAME="DeepSeek-R1-0528" | |
| else | |
| echo "Unsupported model prefix: $MODEL_PREFIX. Supported prefixes are: dsr1" |
Change SQUASH_FILE path from /mnt/nfs/slurm-shared/containers/ to /mnt/nfs/sa-shared/containers/ to match cluster configuration. Co-authored-by: Cursor <cursoragent@cursor.com>
functionstackx
left a comment
There was a problem hiding this comment.
LGTM, feel free to merge if the single sweep error is transient
…trt changelog to end Co-authored-by: Cursor <cursoragent@cursor.com>
Update H100 section with verified Pareto-optimal points:
1k1k ISL/OSL:
8k1k ISL/OSL (new):
All configs use TP=16, EP=16 across 2 nodes (16 GPUs total).