Skip to content

[NV] [DoNotMerge] Add DSV4-pro GB300 vLLM recipes#1238

Open
hjjq wants to merge 9 commits intomainfrom
dsv4-gb300-vllm
Open

[NV] [DoNotMerge] Add DSV4-pro GB300 vLLM recipes#1238
hjjq wants to merge 9 commits intomainfrom
dsv4-gb300-vllm

Conversation

@hjjq
Copy link
Copy Markdown
Collaborator

@hjjq hjjq commented Apr 30, 2026

No description provided.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +7752 to +7758
dsv4-fp4-gb300-dynamo-vllm:
image: vllm/vllm-openai:v0.20.0-ubuntu2404
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: gb300-cw
precision: fp4
framework: dynamo-vllm
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new dsv4-fp4-gb300-dynamo-vllm config (.github/configs/nvidia-master.yaml:7752-7842) sets runner: gb300-cw + framework: dynamo-vllm, but runners/launch_gb300-cw.sh only accepts dynamo-sglang for dsv4/fp4 and exits 1 otherwise — every concurrency point in this PR will hard-fail at startup before any srtctl submission. The launcher needs a dynamo-vllm branch (mirroring the existing one in runners/launch_gb200-nv.sh:47 and :149) plus the recipe overlay logic for the new recipes/vllm/deepseek-v4/8k1k/disagg-gb300-*.yaml paths.

Extended reasoning...

What the bug is

The new dsv4-fp4-gb300-dynamo-vllm config added in this PR (.github/configs/nvidia-master.yaml:7752-7842) sets runner: gb300-cw and framework: dynamo-vllm. CI dispatches based on the runner stem — benchmark-multinode-tmpl.yml:177 runs bash ./runners/launch_${RUNNER_NAME%%_*}.sh, so any gb300-cw_N runner routes to runners/launch_gb300-cw.sh. That launcher's top gate is:

if [[ $FRAMEWORK == "dynamo-sglang" && $MODEL_PREFIX == "dsv4" && $PRECISION == "fp4" ]]; then
    export MODEL_PATH="/mnt/vast/models/dsv4/"
else
    echo "Unsupported model prefix/precision/framework combination on gb300-cw: $MODEL_PREFIX/$PRECISION/$FRAMEWORK. Currently supported: dsv4/fp4/dynamo-sglang"
    exit 1
fi

With FRAMEWORK=dynamo-vllm, the if-clause is false and the script exits 1 before any srtctl submission. The else branch even spells out the only supported combination as dsv4/fp4/dynamo-sglang.

Step-by-step proof

  1. CI matrix resolves runner: gb300-cwRUNNER_NAME=gb300-cw_N (e.g. gb300-cw_0).
  2. benchmark-multinode-tmpl.yml:177 invokes bash ./runners/launch_${RUNNER_NAME%%_*}.sh, which expands to bash ./runners/launch_gb300-cw.sh.
  3. Env: FRAMEWORK=dynamo-vllm, MODEL_PREFIX=dsv4, PRECISION=fp4.
  4. launch_gb300-cw.sh:10 evaluates [[ "dynamo-vllm" == "dynamo-sglang" && ... ]] → false.
  5. The else at lines 15-17 prints Unsupported model prefix/precision/framework combination on gb300-cw: dsv4/fp4/dynamo-vllm. Currently supported: dsv4/fp4/dynamo-sglang and runs exit 1.
  6. Every one of the 6 search-space entries (concurrencies 192/256, 18/36/72, 4096 ×3, 2048/3072) hits this same path and dies before srtctl is invoked.

Why nothing else saves it

  • The companion runners/launch_gb300-nv.sh only supports dsr1/fp4 and dsr1/fp8 (lines 13-24, with explicit Supported models are: dsr1-fp4, dsr1-fp8 error), so swapping runner: gb300 would also fail.
  • A grep for dynamo-vllm in launch_gb300-cw.sh returns only the comment block at lines 4-5 referencing PR vLLM GB300 Day 0 DSV4 FP4 disagg  #1150 — there is no executable branch in this PR.
  • The recipe overlay block at launch_gb300-cw.sh:113-114 only copies recipes/sglang/deepseek-v4. Even if the framework gate were patched, the new config's CONFIG_FILE=recipes/vllm/deepseek-v4/8k1k/disagg-gb300-*.yaml paths would not exist in the upstream srt-slurm checkout, so srtctl apply would fail at YAML resolution.

Fix

Mirror the two-branch pattern already present in runners/launch_gb200-nv.sh:

  1. Add a dynamo-vllm + dsv4 + fp4 branch in the top gating block to set MODEL_PATH / SRT_SLURM_MODEL_PREFIX (analogous to launch_gb200-nv.sh:47-60).
  2. Add a parallel branch in the recipe overlay block that clones the right NVIDIA/srt-slurm ref and overlays benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4 into recipes/vllm/deepseek-v4 (analogous to launch_gb200-nv.sh:149-158).

Until both pieces land, the entire newly-added benchmark cannot run.

Comment on lines +99 to +100
stream-interval: 50
no-disable-hybrid-kv-cache-manager: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The decode workers in disagg-gb300-1p6d-dep4-tp4.yaml (lines ~99-100) and disagg-gb300-1p17d-tep4-tp4.yaml (lines ~91-92) are plain TP=4 (no data-parallel-size, no enable-expert-parallel), yet still set the EP-only flags enable-ep-weight-filter: true and all2all-backend: flashinfer_nvlink_one_sided. This is inconsistent with every sibling recipe in this directory — every other 8k1k decode that sets these flags also sets enable-expert-parallel: true (see GB200 mid/high/max-tpt-megamoe and the new 7p2d). Looks like a copy-paste from the prefill block; please drop both flags from the decode in 1p6d and 1p17d (the prefill in 1p6d/1p17d should keep them since they are DEP=4 / TEP=4 with EP enabled).

Extended reasoning...

What the bug is

In disagg-gb300-1p6d-dep4-tp4.yaml and disagg-gb300-1p17d-tep4-tp4.yaml, the decode worker is plain TP=4: it sets tensor-parallel-size: 4 with no data-parallel-size and no enable-expert-parallel. The matching nvidia-master.yaml entries also describe these decode workers as ep: 1, dp-attn: false. But each decode block also sets two expert-parallelism-only knobs:

      enable-ep-weight-filter: true
      all2all-backend: "flashinfer_nvlink_one_sided"

enable-ep-weight-filter only has meaning when the model is sharded across EP ranks (it filters the per-expert weights at load time according to the EP assignment). all2all-backend only has meaning when an EP all2all collective is being constructed. With EP off, neither flag has any input to act on.

Why it is wrong

Every other 8k1k recipe in this directory follows a strict pattern: these two flags are set only on workers that also set enable-expert-parallel: true.

  • GB200 plain-TP decode (no EP): disagg-gb200-low-latency.yaml (decode lines 116-134) and disagg-gb200-low-middle-curve.yaml (decode lines 118-136) explicitly comment OUT data-parallel-size and enable-expert-parallel, and do not set enable-ep-weight-filter or all2all-backend.
  • GB200 EP decode: disagg-gb200-mid-curve-megamoe.yaml, high-tpt-megamoe.yaml, max-tpt-megamoe.yaml all set enable-expert-parallel: true together with enable-ep-weight-filter: true.
  • GB300 7p2d decode (this same PR): genuinely DEP=16 (data-parallel-size: 16 + enable-expert-parallel: true), so the flags belong there.
  • GB300 4p1d/5p1d/6p1d c4096 decodes (this same PR): DEP=8 with enable-expert-parallel: true, so they correctly set enable-ep-weight-filter: true (no all2all-backend).

Only the 1p6d and 1p17d decodes break the pattern. The prefill blocks in these same files are DEP=4 (1p6d) and TEP=4 (1p17d) and do correctly enable EP — these flags clearly belong to the prefill, and the decode block reads like a copy-paste from prefill that forgot to drop the EP-only knobs.

Step-by-step proof (1p6d)

  1. disagg-gb300-1p6d-dep4-tp4.yaml decode block (around lines 84-105):
    • tensor-parallel-size: 4
    • no data-parallel-size
    • no enable-expert-parallel
    • enable-ep-weight-filter: true
    • all2all-backend: "flashinfer_nvlink_one_sided"
  2. nvidia-master.yaml entry for this config: decode: ep: 1, dp-attn: false — confirms no EP.
  3. Compare with sibling disagg-gb200-low-latency.yaml decode (lines 116-134): plain TP, EP commented out, neither EP-only flag set.
  4. Compare with sibling disagg-gb200-max-tpt-megamoe.yaml decode: enable-expert-parallel: true + enable-ep-weight-filter: true together.
  5. The 1p6d/1p17d decode is the only place in this directory where enable-ep-weight-filter and all2all-backend appear without enable-expert-parallel. Not following the established convention.

Impact

In the best case vLLM silently ignores both flags when EP is off, making them dead config that future readers will follow as a (wrong) template. In the worst case vLLM rejects the unsupported combination at startup, or enable-ep-weight-filter mis-prunes the MoE weights even though no EP all2all collective is being constructed. Either way these are tuning knobs in benchmark recipes that should match what they actually run.

Fix

Drop enable-ep-weight-filter and all2all-backend from the decode blocks of both disagg-gb300-1p6d-dep4-tp4.yaml and disagg-gb300-1p17d-tep4-tp4.yaml. Keep them in the prefill blocks of both files (1p6d prefill is DEP=4 with EP; 1p17d prefill is TEP=4 with EP — enable-ep-weight-filter legitimately belongs there).

Comment on lines +106 to +107
compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}'
gpu-memory-utilization: 0.9
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The decode worker omits max-num-batched-tokens while every sibling DEP=8 decode in this PR (1p6d/1p17d/7p2d) and all GB200 megamoe siblings (max-tpt/high-tpt/mid-curve) set it to 512 alongside max-num-seqs: 512 and max-cudagraph-capture-size: 512. The same omission occurs in disagg-gb300-5p1d-dep4-dep8-28-c4096.yaml and disagg-gb300-6p1d-dep4-dep8-32-c4096.yaml — please add max-num-batched-tokens: 512 to the decode block in all three c4096 recipes to match the established pattern (without it, vLLM falls back to ≈max(2048, max_model_len)=16384, which is mismatched with the FULL_DECODE_ONLY cudagraph capture cap of 512).

Extended reasoning...

What is the bug?

In the three new GB300 max-throughput recipes added by this PR — disagg-gb300-4p1d-dep4-dep8-24-c4096.yaml, disagg-gb300-5p1d-dep4-dep8-28-c4096.yaml, and disagg-gb300-6p1d-dep4-dep8-32-c4096.yaml — the decode block sets max-num-seqs: 512 and max-cudagraph-capture-size: 512 but omits max-num-batched-tokens. Every other sibling recipe sets it explicitly to 512.

Specific code path

In disagg-gb300-4p1d-dep4-dep8-24-c4096.yaml (lines 106–107 region), the decode worker has:

decode:
  ...
  max-model-len: 16384
  max-num-seqs: 512
  max-cudagraph-capture-size: 512
  trust-remote-code: true
  ...
  compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}'

Compare to the sibling disagg-gb300-1p6d-dep4-tp4.yaml decode block, which has max-num-batched-tokens: 512 between max-cudagraph-capture-size: 512 and trust-remote-code: true. The same line is present in disagg-gb300-1p17d-tep4-tp4.yaml, disagg-gb300-7p2d-dep4-dep16.yaml (this PR), and in every GB200 megamoe sibling (disagg-gb200-max-tpt-megamoe.yaml, disagg-gb200-high-tpt-megamoe.yaml, disagg-gb200-mid-curve-megamoe.yaml).

Why existing code doesn't prevent it

vLLM does not require max-num-batched-tokens to be set explicitly. When omitted, it defaults to max(2048, max_model_len), which is 16384 here — the same value used legitimately on the prefill worker. Nothing in the YAML schema or in vLLM's launch path will flag this as an inconsistency, so the misconfiguration silently propagates.

Impact

The decode worker captures CUDA graphs with FULL_DECODE_ONLY up to a batch size of 512 (max-cudagraph-capture-size: 512). When the scheduler is allowed to batch up to 16384 tokens but the cudagraph cap is 512, the scheduler can in principle dispatch decode steps that exceed the captured graph size, falling back to eager and degrading throughput at concurrency=4096. Practical impact is bounded because max-num-seqs=512 already caps decode batches in pure-decode disagg (one token per sequence per step), so the inconsistency is unlikely to actually trigger a fallback in steady state. Still, the deviation from the established pattern across all sibling recipes is a clear copy-paste oversight worth fixing for consistency and to avoid surprises if scheduling assumptions change.

How to fix

Add max-num-batched-tokens: 512 to the decode block in all three c4096 recipes, matching the placement used in sibling recipes:

decode:
  ...
  max-num-seqs: 512
  max-cudagraph-capture-size: 512
  max-num-batched-tokens: 512   # <-- add this line
  trust-remote-code: true
  ...

Step-by-step proof

  1. Open disagg-gb300-4p1d-dep4-dep8-24-c4096.yaml decode block. Keys present: max-model-len: 16384, max-num-seqs: 512, max-cudagraph-capture-size: 512. No max-num-batched-tokens.
  2. Open disagg-gb300-1p6d-dep4-tp4.yaml decode block (added by the same PR). Keys present: max-model-len: 16384, max-num-seqs: 512, max-cudagraph-capture-size: 512, max-num-batched-tokens: 512.
  3. Open disagg-gb300-7p2d-dep4-dep16.yaml decode block (added by the same PR, also DEP-decode + deep_gemm_mega_moe). Same DEP-decode pattern with max-num-batched-tokens: 512 explicitly set.
  4. Open the closest GB200 sibling disagg-gb200-max-tpt-megamoe.yaml decode block — also a max-throughput megamoe DEP=8 decode at high concurrency. max-num-batched-tokens: 512 is set alongside the same max-num-seqs=512/max-cudagraph-capture-size=512 pair.
  5. Repeat steps 1–4 for disagg-gb300-5p1d-dep4-dep8-28-c4096.yaml and disagg-gb300-6p1d-dep4-dep8-32-c4096.yaml. Both have decode blocks structurally identical to 4p1d and both omit max-num-batched-tokens.
  6. Conclusion: only the three new c4096 max-throughput recipes deviate; the remaining 100% of sibling DEP-decode recipes set the field. The omission is a copy-paste oversight from the prefill block (which legitimately uses 16384).

@hjjq hjjq changed the title [NV] Add DSV4-pro GB300 vLLM recipes [NV] [DoNotMerge] Add DSV4-pro GB300 vLLM recipes Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants