[NV] Add deepseek-v4-pro b300 vllm config#1144
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
@claude please add description |
|
Claude finished @kedarpotdar-nv's task in 52s —— View job Adding PR Description
PR description has been added. It includes:
|
| search-space: | ||
| - { tp: 8, conc-start: 4, conc-end: 4 } | ||
| - { tp: 4, conc-start: 4, conc-end: 128 } | ||
| - { tp: 8, conc-start: 128, conc-end: 128 } | ||
| - { tp: 4, dp-attn: true, conc-start: 256, conc-end: 512 } | ||
| - isl: 8192 | ||
| osl: 1024 | ||
| search-space: | ||
| - { tp: 8, conc-start: 4, conc-end: 4 } | ||
| - { tp: 4, conc-start: 4, conc-end: 128 } | ||
| - { tp: 8, conc-start: 128, conc-end: 128 } | ||
| - { tp: 4, dp-attn: true, conc-start: 256, conc-end: 512 } |
There was a problem hiding this comment.
🔴 All 4 search-space entries for dsv4-fp8-b300-vllm (nvidia-master.yaml:2402-2413) omit the ep field, so generate_sweep_configs.py defaults each matrix entry to ep=1. But benchmarks/single_node/dsv4_fp8_b300.sh always passes --enable-expert-parallel, meaning the actual EP is 8 (for tp:8), 4 (for tp:4), or 4 (for tp:4/dp-attn:true) — never 1. Downstream metadata (RESULT_FILENAME, process_result.py, compare_results.py/summarize.py grouping keys) will therefore record ep=1 for every data point. Fix by adding ep: 8 to the two tp:8 entries and ep: 4 to the two tp:4 entries, mirroring the adjacent dsv4-fp8-h200-vllm config and PR #919's metadata cleanup.
Extended reasoning...
What the bug is. The newly added dsv4-fp8-b300-vllm block (.github/configs/nvidia-master.yaml:2388-2413) declares four search-space entries across its two seq-len configs and none of them set the ep field: {tp:8,...}, {tp:4,...}, {tp:8,...}, {tp:4,dp-attn:true,...}. In contrast, the sibling dsv4-fp8-h200-vllm at line 2385 correctly specifies ep: 8, which is the established convention for MoE configs in this file.
Why the default is wrong for this recipe. utils/matrix_logic/generate_sweep_configs.py:354 initializes Fields.EP.value to 1 for single-node entries and only overrides it (lines 362-363) when ep is explicitly present in the YAML entry. So every generated matrix row for this config gets ep=1. However, benchmarks/single_node/dsv4_fp8_b300.sh unconditionally passes --enable-expert-parallel on the vllm serve command (line ~76 of the new script), independent of TP or DP_ATTENTION. With vLLM's expert-parallel semantics, the effective expert-parallel degree equals the world size (TP × DP), so the runtime EP is 8 or 4, never 1.
How the metadata mismatch propagates. The EP value from the matrix becomes EP_SIZE via .github/workflows/benchmark-tmpl.yml:85, and that value is then (a) embedded in RESULT_FILENAME at line 146 as ep${EP_SIZE}, (b) written into the aggregated JSON by utils/process_result.py:100-108 as data['ep'] = ep_size, (c) used as a grouping key in utils/summarize.py:82,104, and (d) forms the tp{tp}/ep{ep} lookup key in utils/compare_results.py:244. So every single B300 result file for this PR will be named ...ep1... and every aggregated data point will claim ep: 1, while the actual run executed with EP=4 or EP=8. Any downstream baseline comparison or eval grouping will key on a value that doesn't exist in the launched recipe.
Step-by-step proof for the second entry (tp:4, conc 4-128 on 1k1k).
- YAML entry:
{ tp: 4, conc-start: 4, conc-end: 128 }— noepkey. - generate_sweep_configs.py:354 seeds the row with
ep: 1(default) and the tp override setstp: 4; line 362-363 does not run because'ep'is not in the dict. - Matrix row is emitted with
tp=4, ep=1, dp-attn=false. - benchmark-tmpl.yml:85 exports
EP_SIZE=1; line 146 stamps the result file as..._tp4-ep1-dpaFalse_.... - The launch script enters the else-branch (DP_ATTENTION != true), so PARALLEL_ARGS=
--tensor-parallel-size 4 --data-parallel-size 1, and--enable-expert-parallelis always present → vLLM runs with TP=4, DP=1, EP enabled over world size 4 → effective EP=4. - process_result.py reads
EP_SIZE=1from env and writes{'ep': 1, ...}to the JSON — theepfield recorded is 1, the actual EP used was 4.
Why this was not caught earlier. There is no validation that cross-references --enable-expert-parallel in a launch script against the ep field in matrix entries; the coupling is by convention. This is precisely the class of mismatch that PR #919 ('Fix metadata inconsistencies in nvidia-master.yaml - TP/EP/DP-attn values now match actual recipe files') was created to clean up, and that the gptoss-fp4-* and dsr1-fp4-* changelogs repeatedly reference ('Explicitly add EP=TP for DP attention configs', 'Set ep:4 for all tp:4 entries, ep:8 for all tp:8 entries').
Fix. Add explicit ep to each B300 search-space entry to match the launched EP:
{ tp: 8, ep: 8, conc-start: 4, conc-end: 4 }{ tp: 4, ep: 4, conc-start: 4, conc-end: 128 }{ tp: 8, ep: 8, conc-start: 128, conc-end: 128 }{ tp: 4, ep: 4, dp-attn: true, conc-start: 256, conc-end: 512 }
This mirrors the adjacent dsv4-fp8-h200-vllm convention (ep: 8 for tp: 8, dp-attn: true) and keeps RESULT_FILENAME/process_result.py/compare_results.py in sync with the actual runtime EP. Purely metadata-only — no recipe-file changes required.
ce343aa to
0e86648
Compare
Use _${FRAMEWORK} suffix for dsv4 model on B300 so vllm and sglang
configs resolve to distinct benchmark scripts. Other models keep the
existing _trt-only logic. Rename dsv4_fp4_b300.sh to dsv4_fp4_b300_sglang.sh
to match the new convention.
0e86648 to
6a0fa73
Compare
# Conflicts: # benchmarks/single_node/dsv4_fp4_b300_vllm.sh # perf-changelog.yaml # runners/launch_b300-nv.sh
functionstackx
left a comment
There was a problem hiding this comment.
lgtm, resolve conflicts and merge tonight plz
great work @Ankur-singh
Add new B300 vLLM config
dsv4-fp4-b300-vllmfor DeepSeek-V4-Pro single-node benchmarks.