Update dsv4 b300 configs by wzhao18 · Pull Request #1155 · SemiAnalysisAI/InferenceX

wzhao18 · 2026-04-25T15:10:54Z

Update DSv4 B300 configs

github-actions · 2026-04-25T15:11:03Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-25T15:11:03Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-25T15:42:14Z

Claude finished @wzhao18's task in 1m 29s —— View job

Re-review of PR #1155

Previous 🔴 BLOCKING issue (missing perf-changelog.yaml entry) has been fixed in commit 9043b4ab. The new entry is correctly appended to the END of the file and accurately documents the updated search space.

Changes reviewed:

nvidia-master.yaml: Drops plain-TP8 entries, adds DP8 (dp-attn: true, tp: 8) entries, expands DP4 concurrency ranges — config looks correct
dsv4_fp4_b300_vllm.sh: MAX_NUM_BATCHED_TOKENS changes from constant 2048 to ISL/DP-conditional formula (max(ISL, 2048) for DP, max(2*ISL, 2048) for TP) — arithmetic is correct
perf-changelog.yaml: Entry appended at end, references PR Update dsv4 b300 configs #1155, search-space description matches the YAML — correct

Still open from previous review: The 🟡 WARNING about header comments in both the YAML (line 2466) and script (line 4) still describing dp-attn=true as exclusively a DP4/4-GPU switch, when it's now also used for DP8. Not blocking, but worth updating while the files are being touched.

No new blocking issues found. LGTM pending the optional comment update.

claude · 2026-04-25T15:53:34Z

+
+- config-keys:
+    - dsv4-fp4-b300-vllm
+  description:
+    - "Update search space based on B300 pareto sweep results"
+    - "ISL=1024: TP4 conc 4-128; DP4 (dp-attn) conc 256-4096; DP8 (dp-attn) conc 2048-8192"
+    - "ISL=8192: TP4 conc 4-64; DP4 (dp-attn) conc 128-1024; DP8 (dp-attn) conc 1024-8192"


🟡 The new perf-changelog entry at perf-changelog.yaml:1822-1830 documents only the search-space reshape, but the diff also rewrites --max-num-batched-tokens in benchmarks/single_node/dsv4_fp4_b300_vllm.sh from a constant 2048 to an ISL/DP-conditional formula (DP: max(ISL, 2048); TP: max(2ISL, 2048)). PR #1144's prior entry explicitly recorded "max-num-batched-tokens 2048" as part of the launch-args summary, so omitting the rewrite leaves that prior statement stale and the new formula invisible from the changelog audit trail. Suggest appending a description line such as: "Set --max-num-batched-tokens = max(ISL, 2048) for DP and max(2ISL, 2048) for TP, replacing the previous constant 2048".

Extended reasoning...

What the bug is

This PR makes two substantive changes to dsv4-fp4-b300-vllm, but the new perf-changelog.yaml entry (lines 1822-1830) only describes one of them:

Documented: search-space reshape (TP4/DP4/DP8 concurrency ranges for ISL=1024 and ISL=8192).

Undocumented: in benchmarks/single_node/dsv4_fp4_b300_vllm.sh:41-46, 81, --max-num-batched-tokens is changed from a constant 2048 to a function of ISL and DP_ATTENTION:

DP mode: max(ISL, 2048)

TP mode: max(2*ISL, 2048)

Why this matters for the audit trail

The predecessor entry (PR #1144, the introducing PR for this config) explicitly records "max-num-batched-tokens 2048" as part of the launch-arg summary. After this PR that statement is stale, but no follow-up entry replaces it. AGENTS.md describes perf-changelog.yaml as the chronological record of changes that affect benchmarks; the mbt rewrite materially affects request-batching behavior, so it belongs in the description.

Step-by-step proof of impact

For an ISL=8192 run in TP mode under the new formula:

DP_ATTENTION != "true", so the TP branch is taken.

MAX_NUM_BATCHED_TOKENS = 2 * 8192 = 16384 (the floor at 2048 is not binding).

vllm serve --max-num-batched-tokens 16384 — 8x the previous constant 2048.

For ISL=8192 in DP mode ({ tp: 4|8, dp-attn: true }):

DP branch is taken.

MAX_NUM_BATCHED_TOKENS = 8192 — 4x the previous 2048.

These are not cosmetic deltas: they reshape the prefill-batching capacity and will alter throughput / TTFT characteristics relative to anything benchmarked under PR #1144.

Addressing the refutations

One refuter argued this is subjective wording and that perf-changelog descriptions vary in detail (some are one-liners). That is true in general, but here the new entry is specifically a successor to PR #1144's entry for the same config-keys: [dsv4-fp4-b300-vllm], which deliberately enumerated the launch args including max-num-batched-tokens 2048. Following that precedent, the value's replacement deserves a line. A second refuter called this a duplicate of bug_003 — the synthesis agent has merged them; this single comment now stands for both.

Severity and fix

This is a documentation-completeness issue with no runtime impact, so it is filed at nit severity (matching all three independent confirmations). Fix: append one description line to the new entry, e.g.

Set --max-num-batched-tokens = max(ISL, 2048) for DP and max(2*ISL, 2048) for TP, replacing the previous constant 2048.

claude · 2026-04-25T15:53:34Z

+    - { tp: 8, dp-attn: true, conc-start: 2048, conc-end: 8192 }
  - isl: 8192
    osl: 1024
    search-space:
-    - { tp: 8, conc-start: 4, conc-end: 4 }
-    - { tp: 4, conc-start: 4, conc-end: 128 }
-    - { tp: 8, conc-start: 128, conc-end: 128 }
-    - { tp: 4, dp-attn: true, conc-start: 256, conc-end: 512 }
+    - { tp: 4, conc-start: 4, conc-end: 64 }
+    - { tp: 4, dp-attn: true, conc-start: 128, conc-end: 1024 }
+    - { tp: 8, dp-attn: true, conc-start: 1024, conc-end: 8192 }


🟡 The two new TP=8/dp-attn:true rows added to dsv4-fp4-b300-vllm (lines 2482 and 2488) omit the ep field, so generate_sweep_configs.py defaults the metadata to ep=1. But dsv4_fp4_b300_vllm.sh unconditionally passes --enable-expert-parallel and sets --data-parallel-size 8 for these rows, so the actual run is EP=8 — the result-filename template (tp${TP}-ep${EP_SIZE}-dpa${DP_ATTENTION}) and downstream group-by tooling will tag these B300 rows as ep=1 while the underlying run is EP=8. Sister config dsv4-fp8-h200-vllm at line 2458/2462 explicitly tags the analogous TP=8/dp-attn:true row as { tp: 8, ep: 8, dp-attn: true, ... }. Suggest adding explicit ep: 8 to both new TP=8 entries to match the convention. (Note: the existing TP=4 dp-attn:true rows on this same config also omit ep, but that pattern was inherited from PR #1144 — this PR extends the issue to TP=8.)

Extended reasoning...

What the bug is

The two new search-space rows added to dsv4-fp4-b300-vllm omit the ep field:

# .github/configs/nvidia-master.yaml line 2482 - { tp: 8, dp-attn: true, conc-start: 2048, conc-end: 8192 } # line 2488 - { tp: 8, dp-attn: true, conc-start: 1024, conc-end: 8192 }

In utils/matrix_logic/generate_sweep_configs.py:354, the matrix-entry default for ep is 1 (set unconditionally), and lines 362-363 only override ep if the YAML key was present (bmk.get(Fields.EP.value) returns None when omitted). So these matrix entries are tagged with ep=1 in the generated metadata.

Why the actual runtime is EP=8

benchmarks/single_node/dsv4_fp4_b300_vllm.sh does not consult the metadata ep value at all. At line 38 the parallel block sets --data-parallel-size "$TP" (i.e. --data-parallel-size 8 when TP=8 and DP_ATTENTION=true), and at line 78 it unconditionally passes --enable-expert-parallel. Under vLLM, --enable-expert-parallel with --data-parallel-size 8 runs with effective expert-parallel world size 8 (each rank holds 1/8 of the experts). So the runtime is EP=8 while the metadata says EP=1.

Why this is a metadata mismatch worth flagging

The sister config dsv4-fp8-h200-vllm at lines 2458 and 2462 explicitly tags the analogous TP=8/dp-attn:true rows as { tp: 8, ep: 8, dp-attn: true, ... } — confirming that ep: 8 is the established convention for this scenario across the dsv4 family. The metadata flows into RESULT_FILENAME (via .github/workflows/benchmark-tmpl.yml:146, template tp${TP}-ep${EP_SIZE}-dpa${DP_ATTENTION}), so the new B300 rows will be saved as tp8-ep1-dpaTrue while the actual run is EP=8. Downstream group-by tooling (compare_results.py, summarize.py, collect_eval_results.py) keys on ep, so cross-config analysis across the dsv4 family will misclassify these B300 rows.

Step-by-step proof

Harness picks the new YAML row { tp: 8, dp-attn: true, conc-start: 2048, conc-end: 8192 } (line 2482).

generate_sweep_configs.py:354 assigns ep=1 (default). Line 362 checks bmk.get('ep') which is None, so the default is kept. Metadata: tp=8, ep=1, dp-attn=true.

Workflow exports EP_SIZE=1, TP=8, DP_ATTENTION=true. Result-filename template at benchmark-tmpl.yml:146 resolves to tp8-ep1-dpaTrue-....

The script dsv4_fp4_b300_vllm.sh runs with --tensor-parallel-size 1 --data-parallel-size 8 --enable-expert-parallel. vLLM's effective EP world size is 8.

The result file is tagged ep1 while the run is EP=8. The h200 sister config tags the same scenario ep=8.

Impact and fix

Metadata-only — runtime behavior is correct because the script hardcodes parallelism. Severity is nit. Fix: add ep: 8 to both new TP=8 entries (lines 2482 and 2488) to match dsv4-fp8-h200-vllm. Ideally ep: 4 would also be added to the kept TP=4/dp-attn:true rows for full consistency, but that pattern was inherited from PR #1144 and is outside the scope of this PR.

wzhao18 · 2026-04-25T21:20:04Z

@kedarpotdar-nv PR ready for review - previous run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24935460573?pr=1155

One job failed due to timeout. I have remove that. Rest looks fine.

Removed several benchmark configurations and updated the search space for dsv4-fp4-b300-vllm based on recent results.

wzhao18 · 2026-04-26T14:11:56Z

@functionstackx Could you review this PR? Thank you. All changes are ready except needing to resolve perf-changelog.yaml conflicts.

Passing sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24948219425

Oseltamivir

lgtm

Resolve perf-changelog.yaml conflict: keep main's PR #1155 entry, append our PR #1180 entry after it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update dsv4 b300 add configs

97b4aae

wzhao18 requested a review from a team April 25, 2026 15:10

wzhao18 requested a review from kedarpotdar-nv as a code owner April 25, 2026 15:10

github-project-automation Bot added this to InferenceMAX Board Apr 25, 2026

wzhao18 requested a review from jgangani as a code owner April 25, 2026 15:10

wzhao18 marked this pull request as draft April 25, 2026 15:10

wzhao18 changed the title ~~Update dsv4 b300 configs~~ [WIP] Update dsv4 b300 configs Apr 25, 2026

wzhao18 added the full-sweep-enabled label Apr 25, 2026

claude Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread .github/configs/nvidia-master.yaml

Comment thread .github/configs/nvidia-master.yaml Outdated

wzhao18 added the sweep-enabled label Apr 25, 2026

Add perf changelog

9043b4a

wzhao18 marked this pull request as ready for review April 25, 2026 15:41

wzhao18 removed the sweep-enabled label Apr 25, 2026

claude Bot reviewed Apr 25, 2026

View reviewed changes

wzhao18 and others added 2 commits April 25, 2026 09:33

Update

bc2be9c

Update conc-end value in nvidia-master.yaml

bc87ac8

wzhao18 changed the title ~~[WIP] Update dsv4 b300 configs~~ Update dsv4 b300 configs Apr 25, 2026

wzhao18 added 2 commits April 25, 2026 20:07

Update dsv4_fp4_b300_vllm.sh

c5d88fc

Merge branch 'main' into nv/dsv4-b300-agg

7330508

wzhao18 added full-sweep-enabled and removed full-sweep-enabled labels Apr 26, 2026

Simplify MAX_NUM_BATCHED_TOKENS assignment

457398a

wzhao18 changed the title ~~Update dsv4 b300 configs~~ [WIP] Update dsv4 b300 configs Apr 26, 2026

wzhao18 added 3 commits April 26, 2026 00:26

Modify MAX_NUM_BATCHED_TOKENS based on DP_ATTENTION

0ed17a1

Merge branch 'main' into nv/dsv4-b300-agg

716ebac

Refactor benchmark configurations in perf-changelog.yaml

427a963

Removed several benchmark configurations and updated the search space for dsv4-fp4-b300-vllm based on recent results.

wzhao18 changed the title ~~[WIP] Update dsv4 b300 configs~~ Update dsv4 b300 configs Apr 26, 2026

Merge branch 'main' into nv/dsv4-b300-agg

54f5664

Oseltamivir approved these changes Apr 26, 2026

View reviewed changes

Oseltamivir merged commit 0e211bc into main Apr 26, 2026
13 of 45 checks passed

Oseltamivir deleted the nv/dsv4-b300-agg branch April 26, 2026 17:57

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 26, 2026

claude Bot mentioned this pull request Apr 26, 2026

[co-authored with sglang community maintainers leads at radixark] [NVIDIA][SGLang][redo PR] B300 DeepSeek v4 FP4 SGLang: recipe-per-CONC split + DP-attn SWA tweak #1185

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dsv4 b300 configs#1155

Update dsv4 b300 configs#1155
Oseltamivir merged 11 commits intomainfrom
nv/dsv4-b300-agg

wzhao18 commented Apr 25, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Apr 25, 2026 •

edited

Loading

Uh oh!

claude Bot Apr 25, 2026

Uh oh!

claude Bot Apr 25, 2026

Uh oh!

wzhao18 commented Apr 25, 2026

Uh oh!

wzhao18 commented Apr 26, 2026 •

edited

Loading

Uh oh!

Oseltamivir left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wzhao18 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review of PR #1155

Uh oh!

claude Bot Apr 25, 2026

Choose a reason for hiding this comment

What the bug is

Why this matters for the audit trail

Step-by-step proof of impact

Addressing the refutations

Severity and fix

Uh oh!

claude Bot Apr 25, 2026

Choose a reason for hiding this comment

What the bug is

Why the actual runtime is EP=8

Why this is a metadata mismatch worth flagging

Step-by-step proof

Impact and fix

Uh oh!

wzhao18 commented Apr 25, 2026

Uh oh!

wzhao18 commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Oseltamivir left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wzhao18 commented Apr 25, 2026 •

edited

Loading

claude Bot commented Apr 25, 2026 •

edited

Loading

wzhao18 commented Apr 26, 2026 •

edited

Loading