[NV] [DoNotMerge] Add DSV4-pro GB300 vLLM recipes#1238
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| dsv4-fp4-gb300-dynamo-vllm: | ||
| image: vllm/vllm-openai:v0.20.0-ubuntu2404 | ||
| model: deepseek-ai/DeepSeek-V4-Pro | ||
| model-prefix: dsv4 | ||
| runner: gb300-cw | ||
| precision: fp4 | ||
| framework: dynamo-vllm |
There was a problem hiding this comment.
🔴 The new dsv4-fp4-gb300-dynamo-vllm config (.github/configs/nvidia-master.yaml:7752-7842) sets runner: gb300-cw + framework: dynamo-vllm, but runners/launch_gb300-cw.sh only accepts dynamo-sglang for dsv4/fp4 and exits 1 otherwise — every concurrency point in this PR will hard-fail at startup before any srtctl submission. The launcher needs a dynamo-vllm branch (mirroring the existing one in runners/launch_gb200-nv.sh:47 and :149) plus the recipe overlay logic for the new recipes/vllm/deepseek-v4/8k1k/disagg-gb300-*.yaml paths.
Extended reasoning...
What the bug is
The new dsv4-fp4-gb300-dynamo-vllm config added in this PR (.github/configs/nvidia-master.yaml:7752-7842) sets runner: gb300-cw and framework: dynamo-vllm. CI dispatches based on the runner stem — benchmark-multinode-tmpl.yml:177 runs bash ./runners/launch_${RUNNER_NAME%%_*}.sh, so any gb300-cw_N runner routes to runners/launch_gb300-cw.sh. That launcher's top gate is:
if [[ $FRAMEWORK == "dynamo-sglang" && $MODEL_PREFIX == "dsv4" && $PRECISION == "fp4" ]]; then
export MODEL_PATH="/mnt/vast/models/dsv4/"
else
echo "Unsupported model prefix/precision/framework combination on gb300-cw: $MODEL_PREFIX/$PRECISION/$FRAMEWORK. Currently supported: dsv4/fp4/dynamo-sglang"
exit 1
fiWith FRAMEWORK=dynamo-vllm, the if-clause is false and the script exits 1 before any srtctl submission. The else branch even spells out the only supported combination as dsv4/fp4/dynamo-sglang.
Step-by-step proof
- CI matrix resolves
runner: gb300-cw→RUNNER_NAME=gb300-cw_N(e.g.gb300-cw_0). benchmark-multinode-tmpl.yml:177invokesbash ./runners/launch_${RUNNER_NAME%%_*}.sh, which expands tobash ./runners/launch_gb300-cw.sh.- Env:
FRAMEWORK=dynamo-vllm,MODEL_PREFIX=dsv4,PRECISION=fp4. launch_gb300-cw.sh:10evaluates[[ "dynamo-vllm" == "dynamo-sglang" && ... ]]→ false.- The else at lines 15-17 prints
Unsupported model prefix/precision/framework combination on gb300-cw: dsv4/fp4/dynamo-vllm. Currently supported: dsv4/fp4/dynamo-sglangand runsexit 1. - Every one of the 6 search-space entries (concurrencies 192/256, 18/36/72, 4096 ×3, 2048/3072) hits this same path and dies before srtctl is invoked.
Why nothing else saves it
- The companion
runners/launch_gb300-nv.shonly supportsdsr1/fp4anddsr1/fp8(lines 13-24, with explicitSupported models are: dsr1-fp4, dsr1-fp8error), so swappingrunner: gb300would also fail. - A grep for
dynamo-vllminlaunch_gb300-cw.shreturns only the comment block at lines 4-5 referencing PR vLLM GB300 Day 0 DSV4 FP4 disagg #1150 — there is no executable branch in this PR. - The recipe overlay block at
launch_gb300-cw.sh:113-114only copiesrecipes/sglang/deepseek-v4. Even if the framework gate were patched, the new config'sCONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb300-*.yamlpaths would not exist in the upstream srt-slurm checkout, so srtctl apply would fail at YAML resolution.
Fix
Mirror the two-branch pattern already present in runners/launch_gb200-nv.sh:
- Add a
dynamo-vllm + dsv4 + fp4branch in the top gating block to setMODEL_PATH/SRT_SLURM_MODEL_PREFIX(analogous tolaunch_gb200-nv.sh:47-60). - Add a parallel branch in the recipe overlay block that clones the right NVIDIA/srt-slurm ref and overlays
benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4intorecipes/vllm/deepseek-v4(analogous tolaunch_gb200-nv.sh:149-158).
Until both pieces land, the entire newly-added benchmark cannot run.
| stream-interval: 50 | ||
| no-disable-hybrid-kv-cache-manager: true |
There was a problem hiding this comment.
🟡 The decode workers in disagg-gb300-1p6d-dep4-tp4.yaml (lines ~99-100) and disagg-gb300-1p17d-tep4-tp4.yaml (lines ~91-92) are plain TP=4 (no data-parallel-size, no enable-expert-parallel), yet still set the EP-only flags enable-ep-weight-filter: true and all2all-backend: flashinfer_nvlink_one_sided. This is inconsistent with every sibling recipe in this directory — every other 8k1k decode that sets these flags also sets enable-expert-parallel: true (see GB200 mid/high/max-tpt-megamoe and the new 7p2d). Looks like a copy-paste from the prefill block; please drop both flags from the decode in 1p6d and 1p17d (the prefill in 1p6d/1p17d should keep them since they are DEP=4 / TEP=4 with EP enabled).
Extended reasoning...
What the bug is
In disagg-gb300-1p6d-dep4-tp4.yaml and disagg-gb300-1p17d-tep4-tp4.yaml, the decode worker is plain TP=4: it sets tensor-parallel-size: 4 with no data-parallel-size and no enable-expert-parallel. The matching nvidia-master.yaml entries also describe these decode workers as ep: 1, dp-attn: false. But each decode block also sets two expert-parallelism-only knobs:
enable-ep-weight-filter: true
all2all-backend: "flashinfer_nvlink_one_sided"enable-ep-weight-filter only has meaning when the model is sharded across EP ranks (it filters the per-expert weights at load time according to the EP assignment). all2all-backend only has meaning when an EP all2all collective is being constructed. With EP off, neither flag has any input to act on.
Why it is wrong
Every other 8k1k recipe in this directory follows a strict pattern: these two flags are set only on workers that also set enable-expert-parallel: true.
- GB200 plain-TP decode (no EP):
disagg-gb200-low-latency.yaml(decode lines 116-134) anddisagg-gb200-low-middle-curve.yaml(decode lines 118-136) explicitly comment OUTdata-parallel-sizeandenable-expert-parallel, and do not setenable-ep-weight-filterorall2all-backend. - GB200 EP decode:
disagg-gb200-mid-curve-megamoe.yaml,high-tpt-megamoe.yaml,max-tpt-megamoe.yamlall setenable-expert-parallel: truetogether withenable-ep-weight-filter: true. - GB300 7p2d decode (this same PR): genuinely DEP=16 (
data-parallel-size: 16+enable-expert-parallel: true), so the flags belong there. - GB300 4p1d/5p1d/6p1d c4096 decodes (this same PR): DEP=8 with
enable-expert-parallel: true, so they correctly setenable-ep-weight-filter: true(noall2all-backend).
Only the 1p6d and 1p17d decodes break the pattern. The prefill blocks in these same files are DEP=4 (1p6d) and TEP=4 (1p17d) and do correctly enable EP — these flags clearly belong to the prefill, and the decode block reads like a copy-paste from prefill that forgot to drop the EP-only knobs.
Step-by-step proof (1p6d)
disagg-gb300-1p6d-dep4-tp4.yamldecode block (around lines 84-105):tensor-parallel-size: 4- no
data-parallel-size - no
enable-expert-parallel enable-ep-weight-filter: trueall2all-backend: "flashinfer_nvlink_one_sided"
- nvidia-master.yaml entry for this config:
decode: ep: 1, dp-attn: false— confirms no EP. - Compare with sibling
disagg-gb200-low-latency.yamldecode (lines 116-134): plain TP, EP commented out, neither EP-only flag set. - Compare with sibling
disagg-gb200-max-tpt-megamoe.yamldecode:enable-expert-parallel: true+enable-ep-weight-filter: truetogether. - The 1p6d/1p17d decode is the only place in this directory where
enable-ep-weight-filterandall2all-backendappear withoutenable-expert-parallel. Not following the established convention.
Impact
In the best case vLLM silently ignores both flags when EP is off, making them dead config that future readers will follow as a (wrong) template. In the worst case vLLM rejects the unsupported combination at startup, or enable-ep-weight-filter mis-prunes the MoE weights even though no EP all2all collective is being constructed. Either way these are tuning knobs in benchmark recipes that should match what they actually run.
Fix
Drop enable-ep-weight-filter and all2all-backend from the decode blocks of both disagg-gb300-1p6d-dep4-tp4.yaml and disagg-gb300-1p17d-tep4-tp4.yaml. Keep them in the prefill blocks of both files (1p6d prefill is DEP=4 with EP; 1p17d prefill is TEP=4 with EP — enable-ep-weight-filter legitimately belongs there).
| compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}' | ||
| gpu-memory-utilization: 0.9 |
There was a problem hiding this comment.
🟡 The decode worker omits max-num-batched-tokens while every sibling DEP=8 decode in this PR (1p6d/1p17d/7p2d) and all GB200 megamoe siblings (max-tpt/high-tpt/mid-curve) set it to 512 alongside max-num-seqs: 512 and max-cudagraph-capture-size: 512. The same omission occurs in disagg-gb300-5p1d-dep4-dep8-28-c4096.yaml and disagg-gb300-6p1d-dep4-dep8-32-c4096.yaml — please add max-num-batched-tokens: 512 to the decode block in all three c4096 recipes to match the established pattern (without it, vLLM falls back to ≈max(2048, max_model_len)=16384, which is mismatched with the FULL_DECODE_ONLY cudagraph capture cap of 512).
Extended reasoning...
What is the bug?
In the three new GB300 max-throughput recipes added by this PR — disagg-gb300-4p1d-dep4-dep8-24-c4096.yaml, disagg-gb300-5p1d-dep4-dep8-28-c4096.yaml, and disagg-gb300-6p1d-dep4-dep8-32-c4096.yaml — the decode block sets max-num-seqs: 512 and max-cudagraph-capture-size: 512 but omits max-num-batched-tokens. Every other sibling recipe sets it explicitly to 512.
Specific code path
In disagg-gb300-4p1d-dep4-dep8-24-c4096.yaml (lines 106–107 region), the decode worker has:
decode:
...
max-model-len: 16384
max-num-seqs: 512
max-cudagraph-capture-size: 512
trust-remote-code: true
...
compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}'Compare to the sibling disagg-gb300-1p6d-dep4-tp4.yaml decode block, which has max-num-batched-tokens: 512 between max-cudagraph-capture-size: 512 and trust-remote-code: true. The same line is present in disagg-gb300-1p17d-tep4-tp4.yaml, disagg-gb300-7p2d-dep4-dep16.yaml (this PR), and in every GB200 megamoe sibling (disagg-gb200-max-tpt-megamoe.yaml, disagg-gb200-high-tpt-megamoe.yaml, disagg-gb200-mid-curve-megamoe.yaml).
Why existing code doesn't prevent it
vLLM does not require max-num-batched-tokens to be set explicitly. When omitted, it defaults to max(2048, max_model_len), which is 16384 here — the same value used legitimately on the prefill worker. Nothing in the YAML schema or in vLLM's launch path will flag this as an inconsistency, so the misconfiguration silently propagates.
Impact
The decode worker captures CUDA graphs with FULL_DECODE_ONLY up to a batch size of 512 (max-cudagraph-capture-size: 512). When the scheduler is allowed to batch up to 16384 tokens but the cudagraph cap is 512, the scheduler can in principle dispatch decode steps that exceed the captured graph size, falling back to eager and degrading throughput at concurrency=4096. Practical impact is bounded because max-num-seqs=512 already caps decode batches in pure-decode disagg (one token per sequence per step), so the inconsistency is unlikely to actually trigger a fallback in steady state. Still, the deviation from the established pattern across all sibling recipes is a clear copy-paste oversight worth fixing for consistency and to avoid surprises if scheduling assumptions change.
How to fix
Add max-num-batched-tokens: 512 to the decode block in all three c4096 recipes, matching the placement used in sibling recipes:
decode:
...
max-num-seqs: 512
max-cudagraph-capture-size: 512
max-num-batched-tokens: 512 # <-- add this line
trust-remote-code: true
...Step-by-step proof
- Open
disagg-gb300-4p1d-dep4-dep8-24-c4096.yamldecode block. Keys present:max-model-len: 16384,max-num-seqs: 512,max-cudagraph-capture-size: 512. Nomax-num-batched-tokens. - Open
disagg-gb300-1p6d-dep4-tp4.yamldecode block (added by the same PR). Keys present:max-model-len: 16384,max-num-seqs: 512,max-cudagraph-capture-size: 512,max-num-batched-tokens: 512. - Open
disagg-gb300-7p2d-dep4-dep16.yamldecode block (added by the same PR, also DEP-decode +deep_gemm_mega_moe). Same DEP-decode pattern withmax-num-batched-tokens: 512explicitly set. - Open the closest GB200 sibling
disagg-gb200-max-tpt-megamoe.yamldecode block — also a max-throughput megamoe DEP=8 decode at high concurrency.max-num-batched-tokens: 512is set alongside the samemax-num-seqs=512/max-cudagraph-capture-size=512pair. - Repeat steps 1–4 for
disagg-gb300-5p1d-dep4-dep8-28-c4096.yamlanddisagg-gb300-6p1d-dep4-dep8-32-c4096.yaml. Both have decode blocks structurally identical to 4p1d and both omitmax-num-batched-tokens. - Conclusion: only the three new c4096 max-throughput recipes deviate; the remaining 100% of sibling DEP-decode recipes set the field. The omission is a copy-paste oversight from the prefill block (which legitimately uses 16384).
No description provided.