Skip to content

Add DSv4 B200 configs#1156

Merged
Ankur-singh merged 25 commits intomainfrom
nv/dsv4-b200-agg
Apr 27, 2026
Merged

Add DSv4 B200 configs#1156
Ankur-singh merged 25 commits intomainfrom
nv/dsv4-b200-agg

Conversation

@wzhao18
Copy link
Copy Markdown
Collaborator

@wzhao18 wzhao18 commented Apr 25, 2026

Summary

Adds a DeepSeek-V4-Pro FP4 single-node vLLM benchmark recipe for B200, alongside the existing SGLang recipe (dsv4-fp4-b200-sglang).

Changes

New benchmark scriptbenchmarks/single_node/dsv4_fp4_b200_vllm.sh

  • Derived from the B200 pareto sweep
  • TP mode (dp-attn=false): TP=8, no expert parallel
  • DP mode (dp-attn=true): DP=8, expert parallel enabled via EP_SIZE
  • Uses vllm/vllm-openai:deepseekv4-cu130 image
  • FP8 KV cache, block size 256, prefix caching disabled
  • Conditional MAX_NUM_BATCHED_TOKENS logic (ISL=8192 low-conc uses ISL, otherwise 2048)
  • GPU memory utilization capped at 0.85 for DP-attn mode

Master config.github/configs/nvidia-master.yaml

  • New dsv4-fp4-b200-vllm config entry with sweep ranges:
    • ISL=1024/OSL=1024: TP8 conc 1–64 | TP8+EP8 conc 128 | DP-attn conc 256–4096
    • ISL=8192/OSL=1024: TP8 conc 1–32 | DP-attn conc 64–1024

B200 launcher updatesrunners/launch_b200-cw.sh, launch_b200-nb.sh, launch_b200-dgxc.sh

  • Added dispatch-with-fallback pattern (matching launch_b300-nv.sh from PR [NV] Add deepseek-v4-pro b300 vllm config #1144): prefers framework-tagged script name (e.g. dsv4_fp4_b200_vllm.sh), falls back to legacy bare/_trt naming for backwards compatibility
  • Renamed launch_b200-dgxc-slurm.shlaunch_b200-dgxc.sh

Perf changelogperf-changelog.yaml

  • Entry added for dsv4-fp4-b200-vllm

Sweep

Passing sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24977657678

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@wzhao18 wzhao18 changed the title Add DSv4 B200 configs [WIP] Add DSv4 B200 configs Apr 25, 2026
@wzhao18 wzhao18 marked this pull request as ready for review April 25, 2026 16:37
@wzhao18 wzhao18 requested a review from a team April 25, 2026 16:37
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 25, 2026

Claude finished @wzhao18's task in 1m 46s —— View job


Review: Add DSv4 B200 configs

  • Gather context and read changed files
  • Validate master config (model prefixes, image accessibility)
  • Validate perf-changelog.yaml (presence and position)
  • Review benchmark script
  • Post review findings

LGTM - no blocking issues found.

Config follows existing dsv4-fp4-b200-sglang patterns, image is publicly accessible on Docker Hub, perf-changelog entry is correctly appended at the end, benchmark script properly conditionalizes --enable-expert-parallel on EP_SIZE, and server command is well-formatted with line continuations.

Comment on lines +68 to +86
set -x
vllm serve "$MODEL" --host 0.0.0.0 --port "$PORT" \
"${PARALLEL_ARGS[@]}" \
--pipeline-parallel-size 1 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--block-size 256 \
--no-enable-prefix-caching \
"${EP_ARGS[@]}" \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache True \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--max-cudagraph-capture-size 2048 \
--max-model-len "$SERVE_MAX_MODEL_LEN" \
--max-num-batched-tokens "$MAX_NUM_BATCHED_TOKENS" > "$SERVER_LOG" 2>&1 &

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new vllm serve invocation does not pass --max-num-seqs, while nvidia-master.yaml schedules this config with conc-end: 4096 (1k1k DP-attn) and conc-end: 1024 (8k1k DP-attn). vLLM's per-replica default is 256, so even with DP=8 the engine caps in-flight requests at 8×256 = 2048 < 4096; the high-concurrency points will silently queue at the engine and report throughput/latency reflecting the cap rather than the requested concurrency. Suggest adding --max-num-seqs $CONC (or a high static value like 4096) to match the convention used in the sibling B200 vLLM recipes (gptoss_fp4_b200.sh, kimik2.5_fp4_b200.sh, kimik2.5_int4_b200.sh, dsv4_fp8_h200.sh).

Extended reasoning...

The bug: benchmarks/single_node/dsv4_fp4_b200_vllm.sh:69-85 builds the vllm serve command without passing --max-num-seqs. vLLM's default max_num_seqs is 256 per data-parallel replica. The matrix entry added to .github/configs/nvidia-master.yaml (dsv4-fp4-b200-vllm) schedules:

  • ISL=1024, DP-attn TP=8 (DP=8): conc-end: 4096
  • ISL=8192, DP-attn TP=8 (DP=8): conc-end: 1024

With DP=8 and per-replica default 256, the engine accepts at most 8×256 = 2048 concurrent sequences server-wide. So the CONC=4096 sweep point in the 1k1k DP-attn branch cannot actually be served at the requested concurrency — half the requests sit in the client-side or engine waiting queue while only ~2048 are processed in-flight.

Why this matters for the sweep: This is a benchmark recipe whose entire point is to populate a Pareto curve. At CONC=4096 (and likely the second-highest point too) the reported throughput and latency reflect the server cap, not the requested in-flight count, polluting the curve. The output looks plausible (no crash, no error), so the issue is silent — exactly the kind of regression the verifiers flagged as "normal" rather than "nit."

An internal contradiction in the script confirms intent: line 83 sets --max-cudagraph-capture-size 2048, indicating the author expects to capture CUDA graphs for batch sizes up to 2048. But with default --max-num-seqs 256, only batch sizes up to 256 are ever realized per replica, so the larger captured graphs are never exercised. This implies the author meant to lift the seq cap and just forgot.

Sibling recipes consistently set this: every other vLLM script in benchmarks/single_node/ that sweeps comparable concurrencies sets --max-num-seqs explicitly — gptoss_fp4_b200.sh:61 uses 512, kimik2.5_fp4_b200.sh:41 and kimik2.5_int4_b200.sh:41 use $CONC, dsv4_fp8_h200.sh:56 uses 512. The b300 sister script dsv4_fp4_b300_vllm.sh shares the omission, but its conc-end caps at 512 with TP=8/DP≤4 so the default 256-per-replica × DP≥2 is enough; the b200 sweep is the first one to extend past the implicit cap.

Step-by-step proof:

  1. CI launches dsv4-fp4-b200-vllm for the 1k1k DP-attn branch with CONC=4096, TP=8, DP_ATTENTION=true.
  2. PARALLEL_ARGS in dsv4_fp4_b200_vllm.sh:34-37 sets --tensor-parallel-size 1 --data-parallel-size 8. vLLM creates 8 replicas, each with the default max_num_seqs = 256. Total simultaneous in-flight cap: 8×256 = 2048.
  3. run_benchmark_serving (line 95) launches with --max-concurrency 4096 --num-prompts 40960. The client opens ~4096 in-flight requests and feeds them to vLLM.
  4. vLLM accepts ~2048 sequences, queues the rest. Throughput plateaus at the 2048-cap saturation, but the benchmark records this as the CONC=4096 data point.
  5. The Pareto plot then shows two adjacent points (e.g., CONC=2048 and CONC=4096) with effectively identical server-side behavior, distorting the high end of the curve.

Fix: Add --max-num-seqs "$CONC" to the vllm serve invocation (or a static cap ≥ 4096). Using $CONC follows the pattern in kimik2.5_fp4_b200.sh / kimik2.5_int4_b200.sh and ensures the engine never becomes the bottleneck for the configured sweep point. Alternatively, cap conc-end in the matrix at a value that fits the per-replica default × DP, but that loses sweep coverage and is the less attractive option given the matrix already specifies 4096.

Comment thread benchmarks/single_node/dsv4_fp4_b200_vllm.sh
wzhao18 added 3 commits April 25, 2026 20:12
Set MAX_NUM_BATCHED_TOKENS to a fixed value of 2048.
Removed and re-added DeepSeek-V4-Pro benchmark details in the changelog.
@wzhao18 wzhao18 requested a review from Qiaolin-Yu as a code owner April 26, 2026 14:26
Comment thread runners/launch_b200-cw.sh
# Prefer a framework-tagged script (e.g. dsv4_fp4_b200_vllm.sh) so models
# with multiple inference engines can coexist; fall back to the historical
# name without an engine suffix (`_trt` for trt, bare for everyone else).
BENCH_BASE="benchmarks/single_node/${MODEL_CODE}_${PRECISION}_b200"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double check this is back-compatible

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think this is back-compatible, it first checks whether dsv4_fp4_b200_vllm.sh exists, if not, it falls back to checking if file with the old naming scheme exists.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it even need to be backwards compatible ? @Oseltamivir

@wzhao18 wzhao18 changed the title [WIP] Add DSv4 B200 configs Add DSv4 B200 configs Apr 26, 2026
wzhao18 added 3 commits April 26, 2026 16:29
Updated performance changelog with new benchmarks and configurations for DeepSeek-V4-Pro and vLLM. Removed obsolete entries and added links to relevant pull requests.
@wzhao18
Copy link
Copy Markdown
Collaborator Author

wzhao18 commented Apr 26, 2026

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Apr 26, 2026

@wzhao18 can't tell if this is a flake? https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24958988442/job/73094516252
it has failed twice
please look into this

@wzhao18
Copy link
Copy Markdown
Collaborator Author

wzhao18 commented Apr 26, 2026

@cquil11 Thanks for flagging this. Seems the memory is tight in that case, resulting in some flakiness in the runs. The same config passed in concurrency 128 but failed at concurrency 64. I will wait for the new attempt to run. If fails again, I will update the configs to reduce the memory usage.

wzhao18 added 2 commits April 26, 2026 19:46
Removed GPU memory utilization settings for DP mode.
@wzhao18 wzhao18 changed the title Add DSv4 B200 configs [WIP] Add DSv4 B200 configs Apr 26, 2026
@wzhao18
Copy link
Copy Markdown
Collaborator Author

wzhao18 commented Apr 27, 2026

The sweep is passing: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24977657678?pr=1156

PR ready for review / merge.

@wzhao18 wzhao18 changed the title [WIP] Add DSv4 B200 configs Add DSv4 B200 configs Apr 27, 2026
@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator

@claude please add a PR description

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Apr 27, 2026

Claude finished @kedarpotdar-nv's task in 1m 19s —— View job


Adding PR Description

  • Gather context from changed files
  • Write and post PR description

PR description has been added with a summary of all changes: new vLLM benchmark script, master config entry, B200 launcher dispatch-with-fallback updates, and perf-changelog entry.

Copy link
Copy Markdown
Collaborator

@jgangani jgangani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Ankur-singh Ankur-singh merged commit e6554f8 into main Apr 27, 2026
8 of 35 checks passed
@Ankur-singh Ankur-singh deleted the nv/dsv4-b200-agg branch April 27, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

7 participants