Update dsv4 b300 configs#1155
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Claude finished @wzhao18's task in 1m 29s —— View job Re-review of PR #1155Previous 🔴 BLOCKING issue (missing Changes reviewed:
Still open from previous review: The 🟡 WARNING about header comments in both the YAML (line 2466) and script (line 4) still describing No new blocking issues found. LGTM pending the optional comment update. |
|
|
||
| - config-keys: | ||
| - dsv4-fp4-b300-vllm | ||
| description: | ||
| - "Update search space based on B300 pareto sweep results" | ||
| - "ISL=1024: TP4 conc 4-128; DP4 (dp-attn) conc 256-4096; DP8 (dp-attn) conc 2048-8192" | ||
| - "ISL=8192: TP4 conc 4-64; DP4 (dp-attn) conc 128-1024; DP8 (dp-attn) conc 1024-8192" |
There was a problem hiding this comment.
🟡 The new perf-changelog entry at perf-changelog.yaml:1822-1830 documents only the search-space reshape, but the diff also rewrites --max-num-batched-tokens in benchmarks/single_node/dsv4_fp4_b300_vllm.sh from a constant 2048 to an ISL/DP-conditional formula (DP: max(ISL, 2048); TP: max(2ISL, 2048)). PR #1144's prior entry explicitly recorded "max-num-batched-tokens 2048" as part of the launch-args summary, so omitting the rewrite leaves that prior statement stale and the new formula invisible from the changelog audit trail. Suggest appending a description line such as: "Set --max-num-batched-tokens = max(ISL, 2048) for DP and max(2ISL, 2048) for TP, replacing the previous constant 2048".
Extended reasoning...
What the bug is
This PR makes two substantive changes to dsv4-fp4-b300-vllm, but the new perf-changelog.yaml entry (lines 1822-1830) only describes one of them:
- Documented: search-space reshape (TP4/DP4/DP8 concurrency ranges for ISL=1024 and ISL=8192).
- Undocumented: in
benchmarks/single_node/dsv4_fp4_b300_vllm.sh:41-46, 81,--max-num-batched-tokensis changed from a constant2048to a function of ISL and DP_ATTENTION:- DP mode:
max(ISL, 2048) - TP mode:
max(2*ISL, 2048)
- DP mode:
Why this matters for the audit trail
The predecessor entry (PR #1144, the introducing PR for this config) explicitly records "max-num-batched-tokens 2048" as part of the launch-arg summary. After this PR that statement is stale, but no follow-up entry replaces it. AGENTS.md describes perf-changelog.yaml as the chronological record of changes that affect benchmarks; the mbt rewrite materially affects request-batching behavior, so it belongs in the description.
Step-by-step proof of impact
For an ISL=8192 run in TP mode under the new formula:
DP_ATTENTION != "true", so the TP branch is taken.MAX_NUM_BATCHED_TOKENS = 2 * 8192 = 16384(the floor at 2048 is not binding).vllm serve --max-num-batched-tokens 16384— 8x the previous constant 2048.
For ISL=8192 in DP mode ({ tp: 4|8, dp-attn: true }):
- DP branch is taken.
MAX_NUM_BATCHED_TOKENS = 8192— 4x the previous 2048.
These are not cosmetic deltas: they reshape the prefill-batching capacity and will alter throughput / TTFT characteristics relative to anything benchmarked under PR #1144.
Addressing the refutations
One refuter argued this is subjective wording and that perf-changelog descriptions vary in detail (some are one-liners). That is true in general, but here the new entry is specifically a successor to PR #1144's entry for the same config-keys: [dsv4-fp4-b300-vllm], which deliberately enumerated the launch args including max-num-batched-tokens 2048. Following that precedent, the value's replacement deserves a line. A second refuter called this a duplicate of bug_003 — the synthesis agent has merged them; this single comment now stands for both.
Severity and fix
This is a documentation-completeness issue with no runtime impact, so it is filed at nit severity (matching all three independent confirmations). Fix: append one description line to the new entry, e.g.
Set --max-num-batched-tokens = max(ISL, 2048) for DP and max(2*ISL, 2048) for TP, replacing the previous constant 2048.
| - { tp: 8, dp-attn: true, conc-start: 2048, conc-end: 8192 } | ||
| - isl: 8192 | ||
| osl: 1024 | ||
| search-space: | ||
| - { tp: 8, conc-start: 4, conc-end: 4 } | ||
| - { tp: 4, conc-start: 4, conc-end: 128 } | ||
| - { tp: 8, conc-start: 128, conc-end: 128 } | ||
| - { tp: 4, dp-attn: true, conc-start: 256, conc-end: 512 } | ||
| - { tp: 4, conc-start: 4, conc-end: 64 } | ||
| - { tp: 4, dp-attn: true, conc-start: 128, conc-end: 1024 } | ||
| - { tp: 8, dp-attn: true, conc-start: 1024, conc-end: 8192 } |
There was a problem hiding this comment.
🟡 The two new TP=8/dp-attn:true rows added to dsv4-fp4-b300-vllm (lines 2482 and 2488) omit the ep field, so generate_sweep_configs.py defaults the metadata to ep=1. But dsv4_fp4_b300_vllm.sh unconditionally passes --enable-expert-parallel and sets --data-parallel-size 8 for these rows, so the actual run is EP=8 — the result-filename template (tp${TP}-ep${EP_SIZE}-dpa${DP_ATTENTION}) and downstream group-by tooling will tag these B300 rows as ep=1 while the underlying run is EP=8. Sister config dsv4-fp8-h200-vllm at line 2458/2462 explicitly tags the analogous TP=8/dp-attn:true row as { tp: 8, ep: 8, dp-attn: true, ... }. Suggest adding explicit ep: 8 to both new TP=8 entries to match the convention. (Note: the existing TP=4 dp-attn:true rows on this same config also omit ep, but that pattern was inherited from PR #1144 — this PR extends the issue to TP=8.)
Extended reasoning...
What the bug is
The two new search-space rows added to dsv4-fp4-b300-vllm omit the ep field:
# .github/configs/nvidia-master.yaml line 2482
- { tp: 8, dp-attn: true, conc-start: 2048, conc-end: 8192 }
# line 2488
- { tp: 8, dp-attn: true, conc-start: 1024, conc-end: 8192 }In utils/matrix_logic/generate_sweep_configs.py:354, the matrix-entry default for ep is 1 (set unconditionally), and lines 362-363 only override ep if the YAML key was present (bmk.get(Fields.EP.value) returns None when omitted). So these matrix entries are tagged with ep=1 in the generated metadata.
Why the actual runtime is EP=8
benchmarks/single_node/dsv4_fp4_b300_vllm.sh does not consult the metadata ep value at all. At line 38 the parallel block sets --data-parallel-size "$TP" (i.e. --data-parallel-size 8 when TP=8 and DP_ATTENTION=true), and at line 78 it unconditionally passes --enable-expert-parallel. Under vLLM, --enable-expert-parallel with --data-parallel-size 8 runs with effective expert-parallel world size 8 (each rank holds 1/8 of the experts). So the runtime is EP=8 while the metadata says EP=1.
Why this is a metadata mismatch worth flagging
The sister config dsv4-fp8-h200-vllm at lines 2458 and 2462 explicitly tags the analogous TP=8/dp-attn:true rows as { tp: 8, ep: 8, dp-attn: true, ... } — confirming that ep: 8 is the established convention for this scenario across the dsv4 family. The metadata flows into RESULT_FILENAME (via .github/workflows/benchmark-tmpl.yml:146, template tp${TP}-ep${EP_SIZE}-dpa${DP_ATTENTION}), so the new B300 rows will be saved as tp8-ep1-dpaTrue while the actual run is EP=8. Downstream group-by tooling (compare_results.py, summarize.py, collect_eval_results.py) keys on ep, so cross-config analysis across the dsv4 family will misclassify these B300 rows.
Step-by-step proof
- Harness picks the new YAML row
{ tp: 8, dp-attn: true, conc-start: 2048, conc-end: 8192 }(line 2482). generate_sweep_configs.py:354assignsep=1(default). Line 362 checksbmk.get('ep')which isNone, so the default is kept. Metadata:tp=8, ep=1, dp-attn=true.- Workflow exports
EP_SIZE=1,TP=8,DP_ATTENTION=true. Result-filename template atbenchmark-tmpl.yml:146resolves totp8-ep1-dpaTrue-.... - The script
dsv4_fp4_b300_vllm.shruns with--tensor-parallel-size 1 --data-parallel-size 8 --enable-expert-parallel. vLLM's effective EP world size is 8. - The result file is tagged
ep1while the run isEP=8. The h200 sister config tags the same scenarioep=8.
Impact and fix
Metadata-only — runtime behavior is correct because the script hardcodes parallelism. Severity is nit. Fix: add ep: 8 to both new TP=8 entries (lines 2482 and 2488) to match dsv4-fp8-h200-vllm. Ideally ep: 4 would also be added to the kept TP=4/dp-attn:true rows for full consistency, but that pattern was inherited from PR #1144 and is outside the scope of this PR.
|
@kedarpotdar-nv PR ready for review - previous run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24935460573?pr=1155 One job failed due to timeout. I have remove that. Rest looks fine. |
Removed several benchmark configurations and updated the search space for dsv4-fp4-b300-vllm based on recent results.
|
@functionstackx Could you review this PR? Thank you. All changes are ready except needing to resolve Passing sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24948219425 |
Update DSv4 B300 configs