Skip to content

Update dsv4 b300 configs#1155

Merged
Oseltamivir merged 11 commits intomainfrom
nv/dsv4-b300-agg
Apr 26, 2026
Merged

Update dsv4 b300 configs#1155
Oseltamivir merged 11 commits intomainfrom
nv/dsv4-b300-agg

Conversation

@wzhao18
Copy link
Copy Markdown
Collaborator

@wzhao18 wzhao18 commented Apr 25, 2026

Update DSv4 B300 configs

@wzhao18 wzhao18 requested a review from a team April 25, 2026 15:10
@wzhao18 wzhao18 requested a review from kedarpotdar-nv as a code owner April 25, 2026 15:10
@wzhao18 wzhao18 requested a review from jgangani as a code owner April 25, 2026 15:10
@wzhao18 wzhao18 marked this pull request as draft April 25, 2026 15:10
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@wzhao18 wzhao18 changed the title Update dsv4 b300 configs [WIP] Update dsv4 b300 configs Apr 25, 2026
Comment thread .github/configs/nvidia-master.yaml
Comment thread .github/configs/nvidia-master.yaml Outdated
@wzhao18 wzhao18 marked this pull request as ready for review April 25, 2026 15:41
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 25, 2026

Claude finished @wzhao18's task in 1m 29s —— View job


Re-review of PR #1155

Previous 🔴 BLOCKING issue (missing perf-changelog.yaml entry) has been fixed in commit 9043b4ab. The new entry is correctly appended to the END of the file and accurately documents the updated search space.

Changes reviewed:

  • nvidia-master.yaml: Drops plain-TP8 entries, adds DP8 (dp-attn: true, tp: 8) entries, expands DP4 concurrency ranges — config looks correct
  • dsv4_fp4_b300_vllm.sh: MAX_NUM_BATCHED_TOKENS changes from constant 2048 to ISL/DP-conditional formula (max(ISL, 2048) for DP, max(2*ISL, 2048) for TP) — arithmetic is correct
  • perf-changelog.yaml: Entry appended at end, references PR Update dsv4 b300 configs #1155, search-space description matches the YAML — correct

Still open from previous review: The 🟡 WARNING about header comments in both the YAML (line 2466) and script (line 4) still describing dp-attn=true as exclusively a DP4/4-GPU switch, when it's now also used for DP8. Not blocking, but worth updating while the files are being touched.

No new blocking issues found. LGTM pending the optional comment update.

Comment thread perf-changelog.yaml
Comment on lines +1822 to +1828

- config-keys:
- dsv4-fp4-b300-vllm
description:
- "Update search space based on B300 pareto sweep results"
- "ISL=1024: TP4 conc 4-128; DP4 (dp-attn) conc 256-4096; DP8 (dp-attn) conc 2048-8192"
- "ISL=8192: TP4 conc 4-64; DP4 (dp-attn) conc 128-1024; DP8 (dp-attn) conc 1024-8192"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog entry at perf-changelog.yaml:1822-1830 documents only the search-space reshape, but the diff also rewrites --max-num-batched-tokens in benchmarks/single_node/dsv4_fp4_b300_vllm.sh from a constant 2048 to an ISL/DP-conditional formula (DP: max(ISL, 2048); TP: max(2ISL, 2048)). PR #1144's prior entry explicitly recorded "max-num-batched-tokens 2048" as part of the launch-args summary, so omitting the rewrite leaves that prior statement stale and the new formula invisible from the changelog audit trail. Suggest appending a description line such as: "Set --max-num-batched-tokens = max(ISL, 2048) for DP and max(2ISL, 2048) for TP, replacing the previous constant 2048".

Extended reasoning...

What the bug is

This PR makes two substantive changes to dsv4-fp4-b300-vllm, but the new perf-changelog.yaml entry (lines 1822-1830) only describes one of them:

  1. Documented: search-space reshape (TP4/DP4/DP8 concurrency ranges for ISL=1024 and ISL=8192).
  2. Undocumented: in benchmarks/single_node/dsv4_fp4_b300_vllm.sh:41-46, 81, --max-num-batched-tokens is changed from a constant 2048 to a function of ISL and DP_ATTENTION:
    • DP mode: max(ISL, 2048)
    • TP mode: max(2*ISL, 2048)

Why this matters for the audit trail

The predecessor entry (PR #1144, the introducing PR for this config) explicitly records "max-num-batched-tokens 2048" as part of the launch-arg summary. After this PR that statement is stale, but no follow-up entry replaces it. AGENTS.md describes perf-changelog.yaml as the chronological record of changes that affect benchmarks; the mbt rewrite materially affects request-batching behavior, so it belongs in the description.

Step-by-step proof of impact

For an ISL=8192 run in TP mode under the new formula:

  1. DP_ATTENTION != "true", so the TP branch is taken.
  2. MAX_NUM_BATCHED_TOKENS = 2 * 8192 = 16384 (the floor at 2048 is not binding).
  3. vllm serve --max-num-batched-tokens 16384 — 8x the previous constant 2048.

For ISL=8192 in DP mode ({ tp: 4|8, dp-attn: true }):

  1. DP branch is taken.
  2. MAX_NUM_BATCHED_TOKENS = 8192 — 4x the previous 2048.

These are not cosmetic deltas: they reshape the prefill-batching capacity and will alter throughput / TTFT characteristics relative to anything benchmarked under PR #1144.

Addressing the refutations

One refuter argued this is subjective wording and that perf-changelog descriptions vary in detail (some are one-liners). That is true in general, but here the new entry is specifically a successor to PR #1144's entry for the same config-keys: [dsv4-fp4-b300-vllm], which deliberately enumerated the launch args including max-num-batched-tokens 2048. Following that precedent, the value's replacement deserves a line. A second refuter called this a duplicate of bug_003 — the synthesis agent has merged them; this single comment now stands for both.

Severity and fix

This is a documentation-completeness issue with no runtime impact, so it is filed at nit severity (matching all three independent confirmations). Fix: append one description line to the new entry, e.g.

Set --max-num-batched-tokens = max(ISL, 2048) for DP and max(2*ISL, 2048) for TP, replacing the previous constant 2048.

Comment thread .github/configs/nvidia-master.yaml Outdated
Comment on lines +2482 to +2488
- { tp: 8, dp-attn: true, conc-start: 2048, conc-end: 8192 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 4 }
- { tp: 4, conc-start: 4, conc-end: 128 }
- { tp: 8, conc-start: 128, conc-end: 128 }
- { tp: 4, dp-attn: true, conc-start: 256, conc-end: 512 }
- { tp: 4, conc-start: 4, conc-end: 64 }
- { tp: 4, dp-attn: true, conc-start: 128, conc-end: 1024 }
- { tp: 8, dp-attn: true, conc-start: 1024, conc-end: 8192 }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The two new TP=8/dp-attn:true rows added to dsv4-fp4-b300-vllm (lines 2482 and 2488) omit the ep field, so generate_sweep_configs.py defaults the metadata to ep=1. But dsv4_fp4_b300_vllm.sh unconditionally passes --enable-expert-parallel and sets --data-parallel-size 8 for these rows, so the actual run is EP=8 — the result-filename template (tp${TP}-ep${EP_SIZE}-dpa${DP_ATTENTION}) and downstream group-by tooling will tag these B300 rows as ep=1 while the underlying run is EP=8. Sister config dsv4-fp8-h200-vllm at line 2458/2462 explicitly tags the analogous TP=8/dp-attn:true row as { tp: 8, ep: 8, dp-attn: true, ... }. Suggest adding explicit ep: 8 to both new TP=8 entries to match the convention. (Note: the existing TP=4 dp-attn:true rows on this same config also omit ep, but that pattern was inherited from PR #1144 — this PR extends the issue to TP=8.)

Extended reasoning...

What the bug is

The two new search-space rows added to dsv4-fp4-b300-vllm omit the ep field:

# .github/configs/nvidia-master.yaml line 2482
- { tp: 8, dp-attn: true, conc-start: 2048, conc-end: 8192 }
# line 2488
- { tp: 8, dp-attn: true, conc-start: 1024, conc-end: 8192 }

In utils/matrix_logic/generate_sweep_configs.py:354, the matrix-entry default for ep is 1 (set unconditionally), and lines 362-363 only override ep if the YAML key was present (bmk.get(Fields.EP.value) returns None when omitted). So these matrix entries are tagged with ep=1 in the generated metadata.

Why the actual runtime is EP=8

benchmarks/single_node/dsv4_fp4_b300_vllm.sh does not consult the metadata ep value at all. At line 38 the parallel block sets --data-parallel-size "$TP" (i.e. --data-parallel-size 8 when TP=8 and DP_ATTENTION=true), and at line 78 it unconditionally passes --enable-expert-parallel. Under vLLM, --enable-expert-parallel with --data-parallel-size 8 runs with effective expert-parallel world size 8 (each rank holds 1/8 of the experts). So the runtime is EP=8 while the metadata says EP=1.

Why this is a metadata mismatch worth flagging

The sister config dsv4-fp8-h200-vllm at lines 2458 and 2462 explicitly tags the analogous TP=8/dp-attn:true rows as { tp: 8, ep: 8, dp-attn: true, ... } — confirming that ep: 8 is the established convention for this scenario across the dsv4 family. The metadata flows into RESULT_FILENAME (via .github/workflows/benchmark-tmpl.yml:146, template tp${TP}-ep${EP_SIZE}-dpa${DP_ATTENTION}), so the new B300 rows will be saved as tp8-ep1-dpaTrue while the actual run is EP=8. Downstream group-by tooling (compare_results.py, summarize.py, collect_eval_results.py) keys on ep, so cross-config analysis across the dsv4 family will misclassify these B300 rows.

Step-by-step proof

  1. Harness picks the new YAML row { tp: 8, dp-attn: true, conc-start: 2048, conc-end: 8192 } (line 2482).
  2. generate_sweep_configs.py:354 assigns ep=1 (default). Line 362 checks bmk.get('ep') which is None, so the default is kept. Metadata: tp=8, ep=1, dp-attn=true.
  3. Workflow exports EP_SIZE=1, TP=8, DP_ATTENTION=true. Result-filename template at benchmark-tmpl.yml:146 resolves to tp8-ep1-dpaTrue-....
  4. The script dsv4_fp4_b300_vllm.sh runs with --tensor-parallel-size 1 --data-parallel-size 8 --enable-expert-parallel. vLLM's effective EP world size is 8.
  5. The result file is tagged ep1 while the run is EP=8. The h200 sister config tags the same scenario ep=8.

Impact and fix

Metadata-only — runtime behavior is correct because the script hardcodes parallelism. Severity is nit. Fix: add ep: 8 to both new TP=8 entries (lines 2482 and 2488) to match dsv4-fp8-h200-vllm. Ideally ep: 4 would also be added to the kept TP=4/dp-attn:true rows for full consistency, but that pattern was inherited from PR #1144 and is outside the scope of this PR.

@wzhao18 wzhao18 changed the title [WIP] Update dsv4 b300 configs Update dsv4 b300 configs Apr 25, 2026
@wzhao18
Copy link
Copy Markdown
Collaborator Author

wzhao18 commented Apr 25, 2026

@kedarpotdar-nv PR ready for review - previous run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24935460573?pr=1155

One job failed due to timeout. I have remove that. Rest looks fine.

@wzhao18 wzhao18 changed the title Update dsv4 b300 configs [WIP] Update dsv4 b300 configs Apr 26, 2026
wzhao18 added 3 commits April 26, 2026 00:26
Removed several benchmark configurations and updated the search space for dsv4-fp4-b300-vllm based on recent results.
@wzhao18 wzhao18 changed the title [WIP] Update dsv4 b300 configs Update dsv4 b300 configs Apr 26, 2026
@wzhao18
Copy link
Copy Markdown
Collaborator Author

wzhao18 commented Apr 26, 2026

@functionstackx Could you review this PR? Thank you. All changes are ready except needing to resolve perf-changelog.yaml conflicts.

Passing sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24948219425

Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Oseltamivir Oseltamivir merged commit 0e211bc into main Apr 26, 2026
13 of 45 checks passed
@Oseltamivir Oseltamivir deleted the nv/dsv4-b300-agg branch April 26, 2026 17:57
yhyang201 added a commit that referenced this pull request Apr 26, 2026
Resolve perf-changelog.yaml conflict: keep main's PR #1155 entry,
append our PR #1180 entry after it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants