Add DSv4 B200 configs#1156
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Claude finished @wzhao18's task in 1m 46s —— View job Review: Add DSv4 B200 configs
LGTM - no blocking issues found. Config follows existing |
| set -x | ||
| vllm serve "$MODEL" --host 0.0.0.0 --port "$PORT" \ | ||
| "${PARALLEL_ARGS[@]}" \ | ||
| --pipeline-parallel-size 1 \ | ||
| --kv-cache-dtype fp8 \ | ||
| --trust-remote-code \ | ||
| --block-size 256 \ | ||
| --no-enable-prefix-caching \ | ||
| "${EP_ARGS[@]}" \ | ||
| --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \ | ||
| --attention_config.use_fp4_indexer_cache True \ | ||
| --tokenizer-mode deepseek_v4 \ | ||
| --tool-call-parser deepseek_v4 \ | ||
| --enable-auto-tool-choice \ | ||
| --reasoning-parser deepseek_v4 \ | ||
| --max-cudagraph-capture-size 2048 \ | ||
| --max-model-len "$SERVE_MAX_MODEL_LEN" \ | ||
| --max-num-batched-tokens "$MAX_NUM_BATCHED_TOKENS" > "$SERVER_LOG" 2>&1 & | ||
|
|
There was a problem hiding this comment.
🔴 The new vllm serve invocation does not pass --max-num-seqs, while nvidia-master.yaml schedules this config with conc-end: 4096 (1k1k DP-attn) and conc-end: 1024 (8k1k DP-attn). vLLM's per-replica default is 256, so even with DP=8 the engine caps in-flight requests at 8×256 = 2048 < 4096; the high-concurrency points will silently queue at the engine and report throughput/latency reflecting the cap rather than the requested concurrency. Suggest adding --max-num-seqs $CONC (or a high static value like 4096) to match the convention used in the sibling B200 vLLM recipes (gptoss_fp4_b200.sh, kimik2.5_fp4_b200.sh, kimik2.5_int4_b200.sh, dsv4_fp8_h200.sh).
Extended reasoning...
The bug: benchmarks/single_node/dsv4_fp4_b200_vllm.sh:69-85 builds the vllm serve command without passing --max-num-seqs. vLLM's default max_num_seqs is 256 per data-parallel replica. The matrix entry added to .github/configs/nvidia-master.yaml (dsv4-fp4-b200-vllm) schedules:
- ISL=1024, DP-attn TP=8 (DP=8):
conc-end: 4096 - ISL=8192, DP-attn TP=8 (DP=8):
conc-end: 1024
With DP=8 and per-replica default 256, the engine accepts at most 8×256 = 2048 concurrent sequences server-wide. So the CONC=4096 sweep point in the 1k1k DP-attn branch cannot actually be served at the requested concurrency — half the requests sit in the client-side or engine waiting queue while only ~2048 are processed in-flight.
Why this matters for the sweep: This is a benchmark recipe whose entire point is to populate a Pareto curve. At CONC=4096 (and likely the second-highest point too) the reported throughput and latency reflect the server cap, not the requested in-flight count, polluting the curve. The output looks plausible (no crash, no error), so the issue is silent — exactly the kind of regression the verifiers flagged as "normal" rather than "nit."
An internal contradiction in the script confirms intent: line 83 sets --max-cudagraph-capture-size 2048, indicating the author expects to capture CUDA graphs for batch sizes up to 2048. But with default --max-num-seqs 256, only batch sizes up to 256 are ever realized per replica, so the larger captured graphs are never exercised. This implies the author meant to lift the seq cap and just forgot.
Sibling recipes consistently set this: every other vLLM script in benchmarks/single_node/ that sweeps comparable concurrencies sets --max-num-seqs explicitly — gptoss_fp4_b200.sh:61 uses 512, kimik2.5_fp4_b200.sh:41 and kimik2.5_int4_b200.sh:41 use $CONC, dsv4_fp8_h200.sh:56 uses 512. The b300 sister script dsv4_fp4_b300_vllm.sh shares the omission, but its conc-end caps at 512 with TP=8/DP≤4 so the default 256-per-replica × DP≥2 is enough; the b200 sweep is the first one to extend past the implicit cap.
Step-by-step proof:
- CI launches
dsv4-fp4-b200-vllmfor the 1k1k DP-attn branch withCONC=4096,TP=8,DP_ATTENTION=true. PARALLEL_ARGSin dsv4_fp4_b200_vllm.sh:34-37 sets--tensor-parallel-size 1 --data-parallel-size 8. vLLM creates 8 replicas, each with the defaultmax_num_seqs = 256. Total simultaneous in-flight cap: 8×256 = 2048.run_benchmark_serving(line 95) launches with--max-concurrency 4096 --num-prompts 40960. The client opens ~4096 in-flight requests and feeds them to vLLM.- vLLM accepts ~2048 sequences, queues the rest. Throughput plateaus at the 2048-cap saturation, but the benchmark records this as the CONC=4096 data point.
- The Pareto plot then shows two adjacent points (e.g., CONC=2048 and CONC=4096) with effectively identical server-side behavior, distorting the high end of the curve.
Fix: Add --max-num-seqs "$CONC" to the vllm serve invocation (or a static cap ≥ 4096). Using $CONC follows the pattern in kimik2.5_fp4_b200.sh / kimik2.5_int4_b200.sh and ensures the engine never becomes the bottleneck for the configured sweep point. Alternatively, cap conc-end in the matrix at a value that fits the per-replica default × DP, but that loses sweep coverage and is the less attractive option given the matrix already specifies 4096.
Set MAX_NUM_BATCHED_TOKENS to a fixed value of 2048.
Removed and re-added DeepSeek-V4-Pro benchmark details in the changelog.
Adjust MAX_NUM_BATCHED_TOKENS based on ISL and concurrency.
| # Prefer a framework-tagged script (e.g. dsv4_fp4_b200_vllm.sh) so models | ||
| # with multiple inference engines can coexist; fall back to the historical | ||
| # name without an engine suffix (`_trt` for trt, bare for everyone else). | ||
| BENCH_BASE="benchmarks/single_node/${MODEL_CODE}_${PRECISION}_b200" |
There was a problem hiding this comment.
Double check this is back-compatible
There was a problem hiding this comment.
Yes I think this is back-compatible, it first checks whether dsv4_fp4_b200_vllm.sh exists, if not, it falls back to checking if file with the old naming scheme exists.
There was a problem hiding this comment.
Does it even need to be backwards compatible ? @Oseltamivir
Updated performance changelog with new benchmarks and configurations for DeepSeek-V4-Pro and vLLM. Removed obsolete entries and added links to relevant pull requests.
|
Ready for review/merge - sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24958988442 |
|
@wzhao18 can't tell if this is a flake? https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24958988442/job/73094516252 |
|
@cquil11 Thanks for flagging this. Seems the memory is tight in that case, resulting in some flakiness in the runs. The same config passed in concurrency 128 but failed at concurrency 64. I will wait for the new attempt to run. If fails again, I will update the configs to reduce the memory usage. |
Removed GPU memory utilization settings for DP mode.
Add GPU memory utilization argument for DP attention.
|
The sweep is passing: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24977657678?pr=1156 PR ready for review / merge. |
|
@claude please add a PR description |
|
Claude finished @kedarpotdar-nv's task in 1m 19s —— View job Adding PR Description
PR description has been added with a summary of all changes: new vLLM benchmark script, master config entry, B200 launcher dispatch-with-fallback updates, and perf-changelog entry. |
Summary
Adds a DeepSeek-V4-Pro FP4 single-node vLLM benchmark recipe for B200, alongside the existing SGLang recipe (
dsv4-fp4-b200-sglang).Changes
New benchmark script —
benchmarks/single_node/dsv4_fp4_b200_vllm.shdp-attn=false): TP=8, no expert paralleldp-attn=true): DP=8, expert parallel enabled viaEP_SIZEvllm/vllm-openai:deepseekv4-cu130imageMAX_NUM_BATCHED_TOKENSlogic (ISL=8192 low-conc uses ISL, otherwise 2048)Master config —
.github/configs/nvidia-master.yamldsv4-fp4-b200-vllmconfig entry with sweep ranges:B200 launcher updates —
runners/launch_b200-cw.sh,launch_b200-nb.sh,launch_b200-dgxc.shlaunch_b300-nv.shfrom PR [NV] Add deepseek-v4-pro b300 vllm config #1144): prefers framework-tagged script name (e.g.dsv4_fp4_b200_vllm.sh), falls back to legacy bare/_trtnaming for backwards compatibilitylaunch_b200-dgxc-slurm.sh→launch_b200-dgxc.shPerf changelog —
perf-changelog.yamldsv4-fp4-b200-vllmSweep
Passing sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24977657678