Add DSv4 B200 configs by wzhao18 · Pull Request #1156 · SemiAnalysisAI/InferenceX

wzhao18 · 2026-04-25T15:44:24Z

Summary

Adds a DeepSeek-V4-Pro FP4 single-node vLLM benchmark recipe for B200, alongside the existing SGLang recipe (dsv4-fp4-b200-sglang).

Changes

New benchmark script — benchmarks/single_node/dsv4_fp4_b200_vllm.sh

Derived from the B200 pareto sweep
TP mode (dp-attn=false): TP=8, no expert parallel
DP mode (dp-attn=true): DP=8, expert parallel enabled via EP_SIZE
Uses vllm/vllm-openai:deepseekv4-cu130 image
FP8 KV cache, block size 256, prefix caching disabled
Conditional MAX_NUM_BATCHED_TOKENS logic (ISL=8192 low-conc uses ISL, otherwise 2048)
GPU memory utilization capped at 0.85 for DP-attn mode

Master config — .github/configs/nvidia-master.yaml

New dsv4-fp4-b200-vllm config entry with sweep ranges:
- ISL=1024/OSL=1024: TP8 conc 1–64 | TP8+EP8 conc 128 | DP-attn conc 256–4096
- ISL=8192/OSL=1024: TP8 conc 1–32 | DP-attn conc 64–1024

B200 launcher updates — runners/launch_b200-cw.sh, launch_b200-nb.sh, launch_b200-dgxc.sh

Added dispatch-with-fallback pattern (matching launch_b300-nv.sh from PR [NV] Add deepseek-v4-pro b300 vllm config #1144): prefers framework-tagged script name (e.g. dsv4_fp4_b200_vllm.sh), falls back to legacy bare/_trt naming for backwards compatibility
Renamed launch_b200-dgxc-slurm.sh → launch_b200-dgxc.sh

Perf changelog — perf-changelog.yaml

Entry added for dsv4-fp4-b200-vllm

Sweep

Passing sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24977657678

github-actions · 2026-04-25T15:44:35Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-25T16:37:46Z

Claude finished @wzhao18's task in 1m 46s —— View job

Review: Add DSv4 B200 configs

Gather context and read changed files
Validate master config (model prefixes, image accessibility)
Validate perf-changelog.yaml (presence and position)
Review benchmark script
Post review findings

LGTM - no blocking issues found.

Config follows existing dsv4-fp4-b200-sglang patterns, image is publicly accessible on Docker Hub, perf-changelog entry is correctly appended at the end, benchmark script properly conditionalizes --enable-expert-parallel on EP_SIZE, and server command is well-formatted with line continuations.

claude · 2026-04-25T16:58:06Z

+set -x
+vllm serve "$MODEL" --host 0.0.0.0 --port "$PORT" \
+    "${PARALLEL_ARGS[@]}" \
+    --pipeline-parallel-size 1 \
+    --kv-cache-dtype fp8 \
+    --trust-remote-code \
+    --block-size 256 \
+    --no-enable-prefix-caching \
+    "${EP_ARGS[@]}" \
+    --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
+    --attention_config.use_fp4_indexer_cache True \
+    --tokenizer-mode deepseek_v4 \
+    --tool-call-parser deepseek_v4 \
+    --enable-auto-tool-choice \
+    --reasoning-parser deepseek_v4 \
+    --max-cudagraph-capture-size 2048 \
+    --max-model-len "$SERVE_MAX_MODEL_LEN" \
+    --max-num-batched-tokens "$MAX_NUM_BATCHED_TOKENS" > "$SERVER_LOG" 2>&1 &
+


🔴 The new vllm serve invocation does not pass --max-num-seqs, while nvidia-master.yaml schedules this config with conc-end: 4096 (1k1k DP-attn) and conc-end: 1024 (8k1k DP-attn). vLLM's per-replica default is 256, so even with DP=8 the engine caps in-flight requests at 8×256 = 2048 < 4096; the high-concurrency points will silently queue at the engine and report throughput/latency reflecting the cap rather than the requested concurrency. Suggest adding --max-num-seqs $CONC (or a high static value like 4096) to match the convention used in the sibling B200 vLLM recipes (gptoss_fp4_b200.sh, kimik2.5_fp4_b200.sh, kimik2.5_int4_b200.sh, dsv4_fp8_h200.sh).

Extended reasoning...

The bug: benchmarks/single_node/dsv4_fp4_b200_vllm.sh:69-85 builds the vllm serve command without passing --max-num-seqs. vLLM's default max_num_seqs is 256 per data-parallel replica. The matrix entry added to .github/configs/nvidia-master.yaml (dsv4-fp4-b200-vllm) schedules:

ISL=1024, DP-attn TP=8 (DP=8): conc-end: 4096

ISL=8192, DP-attn TP=8 (DP=8): conc-end: 1024

With DP=8 and per-replica default 256, the engine accepts at most 8×256 = 2048 concurrent sequences server-wide. So the CONC=4096 sweep point in the 1k1k DP-attn branch cannot actually be served at the requested concurrency — half the requests sit in the client-side or engine waiting queue while only ~2048 are processed in-flight.

Why this matters for the sweep: This is a benchmark recipe whose entire point is to populate a Pareto curve. At CONC=4096 (and likely the second-highest point too) the reported throughput and latency reflect the server cap, not the requested in-flight count, polluting the curve. The output looks plausible (no crash, no error), so the issue is silent — exactly the kind of regression the verifiers flagged as "normal" rather than "nit."

An internal contradiction in the script confirms intent: line 83 sets --max-cudagraph-capture-size 2048, indicating the author expects to capture CUDA graphs for batch sizes up to 2048. But with default --max-num-seqs 256, only batch sizes up to 256 are ever realized per replica, so the larger captured graphs are never exercised. This implies the author meant to lift the seq cap and just forgot.

Sibling recipes consistently set this: every other vLLM script in benchmarks/single_node/ that sweeps comparable concurrencies sets --max-num-seqs explicitly — gptoss_fp4_b200.sh:61 uses 512, kimik2.5_fp4_b200.sh:41 and kimik2.5_int4_b200.sh:41 use $CONC, dsv4_fp8_h200.sh:56 uses 512. The b300 sister script dsv4_fp4_b300_vllm.sh shares the omission, but its conc-end caps at 512 with TP=8/DP≤4 so the default 256-per-replica × DP≥2 is enough; the b200 sweep is the first one to extend past the implicit cap.

Step-by-step proof:

CI launches dsv4-fp4-b200-vllm for the 1k1k DP-attn branch with CONC=4096, TP=8, DP_ATTENTION=true.

PARALLEL_ARGS in dsv4_fp4_b200_vllm.sh:34-37 sets --tensor-parallel-size 1 --data-parallel-size 8. vLLM creates 8 replicas, each with the default max_num_seqs = 256. Total simultaneous in-flight cap: 8×256 = 2048.

run_benchmark_serving (line 95) launches with --max-concurrency 4096 --num-prompts 40960. The client opens ~4096 in-flight requests and feeds them to vLLM.

vLLM accepts ~2048 sequences, queues the rest. Throughput plateaus at the 2048-cap saturation, but the benchmark records this as the CONC=4096 data point.

The Pareto plot then shows two adjacent points (e.g., CONC=2048 and CONC=4096) with effectively identical server-side behavior, distorting the high end of the curve.

Fix: Add --max-num-seqs "$CONC" to the vllm serve invocation (or a static cap ≥ 4096). Using $CONC follows the pattern in kimik2.5_fp4_b200.sh / kimik2.5_int4_b200.sh and ensures the engine never becomes the bottleneck for the configured sweep point. Alternatively, cap conc-end in the matrix at a value that fits the per-replica default × DP, but that loses sweep coverage and is the less attractive option given the matrix already specifies 4096.

Set MAX_NUM_BATCHED_TOKENS to a fixed value of 2048.

Removed and re-added DeepSeek-V4-Pro benchmark details in the changelog.

Adjust MAX_NUM_BATCHED_TOKENS based on ISL and concurrency.

Oseltamivir · 2026-04-26T19:03:26Z

+# Prefer a framework-tagged script (e.g. dsv4_fp4_b200_vllm.sh) so models
+# with multiple inference engines can coexist; fall back to the historical
+# name without an engine suffix (`_trt` for trt, bare for everyone else).
+BENCH_BASE="benchmarks/single_node/${MODEL_CODE}_${PRECISION}_b200"


Double check this is back-compatible

Yes I think this is back-compatible, it first checks whether dsv4_fp4_b200_vllm.sh exists, if not, it falls back to checking if file with the old naming scheme exists.

Does it even need to be backwards compatible ? @Oseltamivir

Updated performance changelog with new benchmarks and configurations for DeepSeek-V4-Pro and vLLM. Removed obsolete entries and added links to relevant pull requests.

wzhao18 · 2026-04-26T20:32:13Z

Ready for review/merge - sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24958988442

cquil11 · 2026-04-26T21:10:40Z

@wzhao18 can't tell if this is a flake? https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24958988442/job/73094516252
it has failed twice
please look into this

wzhao18 · 2026-04-26T21:33:29Z

@cquil11 Thanks for flagging this. Seems the memory is tight in that case, resulting in some flakiness in the runs. The same config passed in concurrency 128 but failed at concurrency 64. I will wait for the new attempt to run. If fails again, I will update the configs to reduce the memory usage.

Removed GPU memory utilization settings for DP mode.

Add GPU memory utilization argument for DP attention.

wzhao18 · 2026-04-27T16:12:21Z

The sweep is passing: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24977657678?pr=1156

PR ready for review / merge.

kedarpotdar-nv · 2026-04-27T17:53:39Z

@claude please add a PR description

Klaud-Cold · 2026-04-27T17:54:10Z

Claude finished @kedarpotdar-nv's task in 1m 19s —— View job

Adding PR Description

Gather context from changed files
Write and post PR description

PR description has been added with a summary of all changes: new vLLM benchmark script, master config entry, B200 launcher dispatch-with-fallback updates, and perf-changelog entry.

jgangani

LGTM

Add DSv4 B200 configs

4a70e24

github-project-automation Bot added this to InferenceMAX Board Apr 25, 2026

Add changelog

6ee148f

wzhao18 changed the title ~~Add DSv4 B200 configs~~ [WIP] Add DSv4 B200 configs Apr 25, 2026

fixup

707f225

wzhao18 marked this pull request as ready for review April 25, 2026 16:37

wzhao18 requested a review from a team April 25, 2026 16:37

wzhao18 requested review from jgangani and kedarpotdar-nv as code owners April 25, 2026 16:37

wzhao18 added the full-sweep-enabled label Apr 25, 2026

fixup

8ec1310

claude Bot reviewed Apr 25, 2026

View reviewed changes

wzhao18 added 2 commits April 25, 2026 09:59

fixup

81594d7

fix runners

f2fcfae

functionstackx mentioned this pull request Apr 25, 2026

Add B200 config: dsv4-fp4-vllm (DeepSeek-V4-Pro) 🐋 🐋 #1127

Closed

5 tasks

wzhao18 added 3 commits April 25, 2026 20:12

Simplify MAX_NUM_BATCHED_TOKENS calculation

98d83ff

Set MAX_NUM_BATCHED_TOKENS to a fixed value of 2048.

Merge branch 'main' into nv/dsv4-b200-agg

d7df79a

Update perf-changelog.yaml with benchmark changes

c18f413

Removed and re-added DeepSeek-V4-Pro benchmark details in the changelog.

wzhao18 added full-sweep-enabled and removed full-sweep-enabled labels Apr 26, 2026

wzhao18 added 3 commits April 25, 2026 20:45

Merge branch 'main' into nv/dsv4-b200-agg

1e95b00

Modify MAX_NUM_BATCHED_TOKENS logic

5bab835

Adjust MAX_NUM_BATCHED_TOKENS based on ISL and concurrency.

Fix variable name from 'concurrency' to 'CONC'

e28c638

wzhao18 requested a review from Qiaolin-Yu as a code owner April 26, 2026 14:26

wzhao18 and others added 2 commits April 26, 2026 10:27

Merge branch 'main' into nv/dsv4-b200-agg

44729b1

Merge branch 'main' into nv/dsv4-b200-agg

b653b7f

Oseltamivir reviewed Apr 26, 2026

View reviewed changes

wzhao18 changed the title ~~[WIP] Add DSv4 B200 configs~~ Add DSv4 B200 configs Apr 26, 2026

wzhao18 added 3 commits April 26, 2026 16:29

Revise perf-changelog.yaml with new benchmarks

e86d6e9

Updated performance changelog with new benchmarks and configurations for DeepSeek-V4-Pro and vLLM. Removed obsolete entries and added links to relevant pull requests.

Update perf-changelog.yaml

a222193

Merge branch 'main' into nv/dsv4-b200-agg

34d9bc3

wzhao18 added 2 commits April 26, 2026 19:46

lower batch size to fix OOM

3f038c4

Remove GPU memory utilization for DP mode

fae14d9

Removed GPU memory utilization settings for DP mode.

wzhao18 changed the title ~~Add DSv4 B200 configs~~ [WIP] Add DSv4 B200 configs Apr 26, 2026

wzhao18 and others added 5 commits April 26, 2026 19:51

Merge branch 'main' into nv/dsv4-b200-agg

4d99225

Introduce GMU_ARGS for GPU memory utilization

acb8510

Add GPU memory utilization argument for DP attention.

Merge branch 'main' into nv/dsv4-b200-agg

53b7f59

Merge branch 'main' into nv/dsv4-b200-agg

9c2f8c9

Add low-latency configs

984064a

wzhao18 changed the title ~~[WIP] Add DSv4 B200 configs~~ Add DSv4 B200 configs Apr 27, 2026

cquil11 approved these changes Apr 27, 2026

View reviewed changes

kedarpotdar-nv approved these changes Apr 27, 2026

View reviewed changes

Merge branch 'main' into nv/dsv4-b200-agg

09599ae

jgangani approved these changes Apr 27, 2026

View reviewed changes

Ankur-singh merged commit e6554f8 into main Apr 27, 2026
8 of 35 checks passed

Ankur-singh deleted the nv/dsv4-b200-agg branch April 27, 2026 17:58

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 27, 2026

Conversation

wzhao18 commented Apr 25, 2026 • edited by Klaud-Cold Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Sweep

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

claude Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Add DSv4 B200 configs

Uh oh!

claude Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Oseltamivir Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

wzhao18 Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

cquil11 Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

wzhao18 commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cquil11 commented Apr 26, 2026

Uh oh!

wzhao18 commented Apr 26, 2026

Uh oh!

wzhao18 commented Apr 27, 2026

Uh oh!

kedarpotdar-nv commented Apr 27, 2026

Uh oh!

Klaud-Cold commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding PR Description

Uh oh!

jgangani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wzhao18 commented Apr 25, 2026 •

edited by Klaud-Cold

Loading

claude Bot commented Apr 25, 2026 •

edited

Loading

wzhao18 commented Apr 26, 2026 •

edited

Loading

Klaud-Cold commented Apr 27, 2026 •

edited

Loading