Add B300 config: dsv4-fp4-sglang-mtp by cquil11 · Pull Request #1151 · SemiAnalysisAI/InferenceX

cquil11 · 2026-04-25T05:45:25Z

Summary

MTP variant of #1132's `dsv4-fp4-b300-sglang`. Mirrors the same recipe-per-CONC structure with EAGLE / MTP enabled where the cookbook prescribes it, per https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4.

Recipe	CONC	EAGLE flags
low-latency	≤ 32	`--speculative-num-steps 3 --speculative-num-draft-tokens 4`
balanced	33–128	`--speculative-num-steps 1 --speculative-num-draft-tokens 2`
max-throughput	> 128	none — cookbook says verify step costs more than it saves at saturation

`SGLANG_ENABLE_SPEC_V2=1` set per the cookbook MTP requirement; `--use-chat-template` passed to `bench_serving` so EAGLE acceptance rate isn't depressed by raw prompts (per AGENTS.md).

Files

`.github/configs/nvidia-master.yaml` — new `dsv4-fp4-b300-sglang-mtp` config; same TP/EP/dp-attn split as the non-MTP version, all rows tagged `spec-decoding: mtp`.
`benchmarks/single_node/dsv4_fp4_b300_mtp.sh` — adds EAGLE flags to low-latency and balanced `RECIPE_FLAGS` arrays (max-throughput unchanged).
`perf-changelog.yaml` — additive entry to trigger the sweep.

Test plan

`generate_sweep_configs.py --runner-type b300 --model-prefix dsv4` → 17 `spec=mtp` matrix rows for the new config (4+2+3 per 1k1k, 4+2+2 per 8k1k).
`pytest utils/matrix_logic/` → 149 passed.
Sweep run completes; result filenames carry `spec-mtp` and EAGLE acceptance is logged on low-latency / balanced.

Notes

Depends on the same lmsysorg/sglang:deepseek-v4-b300 image as #1132. The DeepEP FP8 weight-postprocess bug that #1132 hit on dpa=true rows is currently being tracked separately; this PR will hit the same failure on balanced + max-throughput until that's resolved upstream.

Adds dsv4-fp4-b300-sglang config, single-node benchmark script, and perf-changelog entry for the DeepSeek-V4 recipe from the SGLang cookbook. The cookbook ships a B200 (not B300) recipe, so this reuses the B200 Flash Low-Latency recipe on B300 until a B300-specific recipe lands. Speculative decoding (EAGLE) and prefix caching are disabled per request. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Match parallelism (TP=8/EP=8/dp-attn=true) and concurrency ranges (4-1024 for 1k1k, 4-512 for 8k1k) to dsv4-fp4-b200-vllm. Use the DeepSeek-V4-Pro variant with the cookbook Max-Throughput recipe (DP=8 + DeepEP, no MTP), which aligns with the requested no-spec parallelism. Prefix caching remains disabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…4-fp4-b200-vllm Port the HF cache mount rework from the DSV4 B200 VLLM branch so both PRs stay consistent: use the shared /scratch/fsw/gharunners/hf-hub-cache path, drop the local MODEL override, and mount onto \$HF_HUB_CACHE inside the container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The dsv4-fp4-b300-sglang entry was appended correctly, but the earlier edit also stripped trailing spaces on an existing line, producing a spurious deletion. Revert so the diff is additive-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirror the lockfile logic already in launch_b200-dgxc-slurm.sh and launch_h200-dgxc-slurm.sh: serialize concurrent enroot imports of the same squash file via flock, skip the import when the squash is already valid, and override ENROOT_CACHE_PATH to avoid permission issues with the system-wide cache on worker nodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The override ("avoid permission issues with system-wide cache on worker nodes") is a dgxc-slurm-specific workaround; launch_b300-nv.sh is on the NV slurm cluster, not dgxc-slurm. Copying it in caused the benchmark srun's pyxis shadow hook to fail with 'mkdir: cannot create directory pyxis_$JOBID.1/data: File exists'. Keep the flock + skip-if-valid logic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move the squash cache from /data/squash to /data/home/sa-shared/gharunners/squash, and the HF cache mount from /scratch/models to /data/home/sa-shared/gharunners/hf-hub-cache. Also mount the host HF cache onto \$HF_HUB_CACHE inside the container so tools reading the default HF path pick it up (matches the B200 dgxc-slurm runner). Drop the /scratch/models Qwen3.5 path override since that path is no longer used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Running two srun steps in the same allocation (flock+import, then the benchmark --container-image srun) reproducibly fails on this cluster with: error: pyxis: mkdir: cannot create directory '/scratch/data/user-$UID/pyxis_$JOBID.1/data': File exists error: pyxis: [ERROR] /etc/enroot/hooks.d/10-shadow.sh exited with return code 1 Per NVIDIA/pyxis#138, two srun steps sharing an allocation can leave enroot/pyxis state between steps. Collapsing to a single srun (the benchmark) is the cleanest workaround. Move the flock-guarded enroot import to the host side, before salloc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Even with a single srun step, pyxis fails with error: pyxis: mkdir: cannot create directory '/scratch/data/user-$UID/pyxis_$JOBID.0/data': File exists on fresh SLURM JOB_IDs. The /scratch path is left behind by previous jobs whose IDs SLURM later reuses (and the cluster's pyxis epilog doesn't clean it up). Wipe pyxis_$JOBID.* from the host after salloc; no-op if /scratch is node-local, effective if it's shared NFS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #1128 (dsv4-fp4fp8-b300-vllm) runs on the same cluster with ZERO changes to launch_b300-nv.sh. The pyxis 10-shadow.sh failures we were chasing aren't caused by the runner -- reset it to origin/main and keep the sglang config/bench additions only. Reverts (from this branch): - 4bb1f1a point B300 runner at shared gharunners/{squash,hf-hub-cache} - 106deea drop ENROOT_CACHE_PATH override - 97a488e add flock-guarded squash import - 744c5a0 move enroot import out of srun - d003c59 wipe stale pyxis scratch before benchmark srun Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move enroot import out of srun to the head node and serialize parallel GH jobs with flock on the shared squash file. Skips the import when a valid squash already exists. The benchmark srun is now the only step in the allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Port the B200 branch's fix for the lmsysorg/sglang:deepseek-v4-blackwell image on B300: - The image installs sglang editable under /workspace/sglang; the default $GITHUB_WORKSPACE:/workspace/ bind-mount masks the install and breaks 'import sglang'. For this image, mount at /ix instead. - The image's ENV bakes CUDA_VISIBLE_DEVICES=4,5,6,7, masking half the GPUs Slurm allocates. Unset it in the bench script so TP=8 sees all 8. - Write artefacts under $PWD instead of hard-coded /workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-staged models on the B300 cluster live under /data/models (Qwen3.5-397B-A17B-FP8, dsv4-pro, etc.). Switch HF_HUB_CACHE_MOUNT from /scratch/models to /data/models, and export MODEL to /data/models/dsv4-pro when MODEL_PREFIX=dsv4 so the benchmark reads from the mounted dir directly. The bench script skips `hf download` when MODEL looks like an absolute path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The stock lmsysorg/sglang:deepseek-v4-blackwell image ships kernels compiled for B200 (SM_100) and crashes on B300 with RuntimeError: RMSNorm failed with error code no kernel image is available for execution on the device during CUDA graph capture. Switch to cquil/sglang-deepseek-v4-bw-ultra:v1, which is recompiled with B300 SM support. Broaden the /ix mount conditional to match both image tags: the fork keeps the same /workspace/sglang editable install that would otherwise be masked by $GITHUB_WORKSPACE:/workspace/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Use the B300-recompiled image from yhyang201; extend the /ix mount conditional to match the new tag in addition to the previous deepseek-v4-blackwell / deepseek-v4-bw-ultra patterns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirror chore/dsv4-sgl-b200 commits 103a202 + 43be495 for B300: Bench script now selects one of three cookbook recipes by CONC instead of a single static flag set: CONC <= 32 -> low-latency (TP only, chunked-prefill 4096, disable-flashinfer-autotune) 33..128 -> balanced (+ DP-attention, max-running-reqs=128, cuda-graph-max-bs=64, deepep-config) CONC > 128 -> max-throughput (+ DP-attention, max-running-reqs=256, cuda-graph-max-bs=64, deepep-config) No speculative decoding in any recipe; --disable-radix-cache kept for the no-prefix-caching baseline. Split the dsv4-fp4-b300-sglang search-space rows per recipe boundary so result filenames (ep=, dpa=) accurately reflect which recipe ran. ep=8 on balanced/max-throughput reflects sglang's implicit ep_size=tp_size override when --moe-a2a-backend deepep is set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Switch B300 dsv4 sglang image to lmsysorg/sglang:deepseek-v4-b300 and extend the /ix mount conditional to match the new tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The DeepEP FP8 weight-postprocess path is broken for deepseek-ai/DeepSeek-V4-Pro on B300 with lmsysorg/sglang:deepseek-v4-b300 -- every sglang launch with --moe-a2a-backend deepep fails during model load with RuntimeError: Recipe must be a list/tuple of 3 integers. raised from sglang.srt.layers.quantization.fp8 .process_weights_after_loading_block_quant (fp8.py:957). The balanced and max-throughput recipes both go through that path; the low-latency recipe (TP-only, flashinfer_mxfp4 MoE) does not and loads cleanly. Collapse the yaml search-space back to a single row spanning the full CONC range (4..1024 for 1k1k, 4..512 for 8k1k) and hardcode the bench script to the low-latency flags at every CONC. TODO(Cam) noted in both files to restore the recipe-per-CONC dispatch once the DeepEP FP8 load path is fixed upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ery CONC" This reverts commit bc43672.

Mirrors dsv4-fp4-b300-sglang but with EAGLE / MTP enabled per the cookbook recipes at https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4: low-latency -> EAGLE 3 steps / 4 draft tokens balanced -> EAGLE 1 step / 2 draft tokens max-throughput -> MTP off (verify step costs more than it saves at saturation, per the cookbook) Sets SGLANG_ENABLE_SPEC_V2=1 (required for MTP) and passes --use-chat-template to bench_serving so EAGLE acceptance rate isn't depressed by raw prompts. Search-space rows tagged spec-decoding=mtp; same TP/EP/dp-attn and concurrency ranges as the non-MTP config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-25T05:45:32Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-25T05:45:32Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

…e comment - nvidia-master.yaml: rewrite stale dsv4-fp4-b300-sglang header that still claimed "Max-Throughput recipe (DP=8 + DeepEP, no MTP)" even though the rows now span all three recipes; drop the max-throughput search-space rows from dsv4-fp4-b300-sglang-mtp since the cookbook says MTP is disabled at saturation, so labelling those rows spec-decoding=mtp would be misleading. - dsv4_fp4_b300_mtp.sh: collapse the three-branch recipe dispatch to two (low-latency, balanced) since CONC > 128 is no longer in the search-space. - perf-changelog.yaml: clarify that max-throughput is intentionally omitted, not "MTP off but tagged mtp". Matrix now expands to 12 mtp rows for dsv4-fp4-b300-sglang-mtp (4 low-latency + 2 balanced per seq-len). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude · 2026-04-25T05:58:36Z

+#!/usr/bin/env bash
+
+source "$(dirname "$0")/../benchmark_lib.sh"
+


🔴 The new benchmarks/single_node/dsv4_fp4_b300.sh is dead code for the dsv4-fp4-b300-sglang config: benchmarks/single_node/dsv4_fp4_b300_sglang.sh (added by PR #1146) still exists with the hardcoded low-latency recipe, and runners/launch_b300-nv.sh:267-272 picks the framework-tagged file first and only falls back to the bare name when missing. The newly added balanced and max-throughput rows (tp=8 ep=8 dp-attn=true, conc 64-1024 / 64-512) will execute the stale low-latency recipe but produce result filenames tagged ep=8/dpa=true, mislabelling the data. Fix: rename dsv4_fp4_b300.sh → dsv4_fp4_b300_sglang.sh (overwriting), or delete the stale dsv4_fp4_b300_sglang.sh. The MTP variant is unaffected because dsv4_fp4_b300_sglang_mtp.sh does not exist.

Extended reasoning...

What goes wrong

This PR adds two new benchmark scripts:

benchmarks/single_node/dsv4_fp4_b300.sh — recipe-per-CONC dispatch (low-latency / balanced / max-throughput)

benchmarks/single_node/dsv4_fp4_b300_mtp.sh — MTP variant of the same dispatch

It also expands the dsv4-fp4-b300-sglang YAML from a single TP-only (tp:8, ep:1, conc 4-1024/4-512) row into three recipes per seq-len, with the inline comment "are selected inside benchmarks/single_node/dsv4_fp4_b300.sh by CONC".

The problem: benchmarks/single_node/dsv4_fp4_b300_sglang.sh already exists on main. It was created by PR #1146 (rename of the old dsv4_fp4_b300.sh to add the framework suffix), and was later edited to hardcode the low-latency recipe at every CONC (with a TODO(Cam) comment explicitly pointing at this branch chore/dsv4-sgl-b300 as the place to restore CONC dispatch). The branch's revert commit 90e8f3d recreated dsv4_fp4_b300.sh (the old, pre-rename path) but did not touch the framework-suffixed file — so after this PR merges, both files coexist on disk with different contents.

Why the new script never runs

runners/launch_b300-nv.sh:267-272:

BENCH_BASE="benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_b300" BENCH_SCRIPT="${BENCH_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh" if [[ ! -f "$BENCH_SCRIPT" ]]; then LEGACY_FW_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '') BENCH_SCRIPT="${BENCH_BASE}${LEGACY_FW_SUFFIX}${SPEC_SUFFIX}.sh" fi

The framework-tagged path is preferred; the bare-name fallback only fires when the tagged file is missing.

Step-by-step proof for one row

Take the new balanced row { tp: 8, ep: 8, dp-attn: true, conc-start: 64, conc-end: 128 } from the dsv4-fp4-b300-sglang 1k1k search-space:

The runner is invoked with EXP_NAME=dsv4_1k1k (model-prefix dsv4, seq tag 1k1k), PRECISION=fp4, FRAMEWORK=sglang, no spec → SPEC_SUFFIX="".

BENCH_BASE = "benchmarks/single_node/dsv4_fp4_b300".

BENCH_SCRIPT = "benchmarks/single_node/dsv4_fp4_b300_sglang.sh" — and that file exists (verified on disk: 3008 bytes, blob c9fb238).

The fallback to dsv4_fp4_b300.sh is not taken.

The hardcoded RECIPE=low-latency block runs with --tp 8 (no --dp-size, no --enable-dp-attention, no --moe-a2a-backend deepep), regardless of CONC.

But the result filename is composed from the YAML row's ep=8/dpa=true tags, so the output is mislabelled: low-latency numbers presented as balanced numbers.

The same logic applies to the max-throughput rows (CONC 256-1024 / 256-512). Effectively the entire YAML expansion in this PR is a no-op for the non-MTP config — the low-latency-only recipe runs at all concurrency points, and three out of four search-space rows produce misleading data.

Why MTP is unaffected

For the new dsv4-fp4-b300-sglang-mtp config, SPEC_SUFFIX=_mtp, so the runner first checks for dsv4_fp4_b300_sglang_mtp.sh — which does not exist — then falls back to dsv4_fp4_b300_mtp.sh, which is the new MTP script added in this PR. So MTP works correctly; only the non-MTP config is broken.

Fix

Either:

Rename benchmarks/single_node/dsv4_fp4_b300.sh → benchmarks/single_node/dsv4_fp4_b300_sglang.sh (overwriting the stale file), or

Delete the existing benchmarks/single_node/dsv4_fp4_b300_sglang.sh so the new bare-name file is found via fallback.

The first option is the more direct fix since it leaves the framework-tagged convention intact (matching what PR #1146 standardised).

cquil11 · 2026-04-26T21:16:57Z

closing in favor of #1166

cquil11 and others added 25 commits April 24, 2026 01:14

Merge branch 'main' into chore/dsv4-sgl-b300

5d93913

update b300

08edf26

update b300

d35696c

Switch B300 dsv4 sglang image to lmsysorg/sglang:deepseek-v4-b300 and extend the /ix mount conditional to match the new tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

trigger test check

87c8376

Merge branch 'main' into chore/dsv4-sgl-b300

aa423f0

Revert "feat(dsv4-fp4-b300-sglang): hardcode low-latency recipe at ev…

90e8f3d

…ery CONC" This reverts commit bc43672.

trigger test check

8e3158d

cquil11 added the sweep-enabled label Apr 25, 2026

cquil11 requested a review from a team April 25, 2026 05:45

cquil11 requested review from jgangani and kedarpotdar-nv as code owners April 25, 2026 05:45

cquil11 added the sweep-enabled label Apr 25, 2026

github-project-automation Bot added this to InferenceMAX Board Apr 25, 2026

cquil11 added full-sweep-enabled and removed sweep-enabled labels Apr 25, 2026

claude Bot reviewed Apr 25, 2026

View reviewed changes

cquil11 force-pushed the chore/dsv4-sgl-b300-mtp branch 2 times, most recently from b5854d3 to eb35ba1 Compare April 25, 2026 06:38

cquil11 closed this Apr 26, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add B300 config: dsv4-fp4-sglang-mtp#1151

Add B300 config: dsv4-fp4-sglang-mtp#1151
cquil11 wants to merge 26 commits intomainfrom
chore/dsv4-sgl-b300-mtp

cquil11 commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

claude Bot Apr 25, 2026

Uh oh!

cquil11 commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		#!/usr/bin/env bash

		source "$(dirname "$0")/../benchmark_lib.sh"

Conversation

cquil11 commented Apr 25, 2026

Summary

Files

Test plan

Notes

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

claude Bot Apr 25, 2026

Choose a reason for hiding this comment

What goes wrong

Why the new script never runs

Step-by-step proof for one row

Why MTP is unaffected

Fix

Uh oh!

cquil11 commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant