chore: upstream srt-slurm recipes + first-class recipe field + custom-bench wrapper#1211
chore: upstream srt-slurm recipes + first-class recipe field + custom-bench wrapper#1211
Conversation
Recipes referenced from NVIDIA/srt-slurm@sa-submission-q2-2026 are now tracked under benchmarks/multi_node/srt-slurm-recipes/, mirroring the upstream `recipes/` layout. The master-yaml plumbing for selecting one is hoisted out of `prefill.additional-settings: ["CONFIG_FILE=recipes/..."]` into a first-class `recipe:` field on the search-space entry, validated against on-disk paths so unknown recipes fail fast at sweep generation. The benchmark template resolves it to an absolute scratch-copy path passed to launchers as CONFIG_FILE, so launcher behavior is unchanged otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six launchers each carried a ~22-line copy of the same git-clone, uv-install, venv-create, srtctl-install sequence. Lift it into clone_and_install_srtctl() in benchmarks/benchmark_lib.sh, parameterized by SRT_REPO_URL/SRT_BRANCH and UV_INSTALL_DIR/UV_VENV_DIR env vars so each launcher can keep its workspace- vs-NFS-vs-default-HOME placement decisions explicit at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lift the `echo "$IMAGE" | sed 's/[/:@#]/_/g'` slug used to name squash files
out of 13 launchers and into sanitize_image_filename() in benchmark_lib.sh.
Cluster-specific separator (h100/h200-dgxc-slurm use '+' instead of '_') is
expressed as the second arg, and the nvcr.io/-prefix-strip variant becomes
`sanitize_image_filename "${IMAGE#nvcr.io/}" +` rather than a sed pipeline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Restructure benchmarks/multi_node/srt-slurm-recipes/ from the upstream's heterogeneous layout into a uniform tree: <model>/<framework>/<hw>-<precision>/<isl>/<osl>/<agg|disagg>/<stp|mtp>/<recipe>.yaml so a sweep contributor can navigate to dsr1/trtllm/b200-fp4/1k/1k/disagg/mtp/ and immediately see every recipe that fits that cell. The 3 sglang multi-override files that span both stp and mtp are parked one level shallower (no trailing stp|mtp/), since the override section selects spec. 365 files moved, 388 active + 5 commented recipe references rewritten, schema validation + tests still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per request, drop the awkward `1k/1k/` two-segment intermediate in the recipe tree in favor of `1k1k/`. New shape: <model>/<framework>/<hw>-<precision>/<isl><osl>/<agg|disagg>/<stp|mtp>/<recipe>.yaml 370 files renamed, 393 recipe references in nvidia-master.yaml rewritten, schema validation + tests still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the ishandhanani/srt-slurm@sa-submission-q1-2026 fallback in launch_gb200-nv.sh — every launcher now clones NVIDIA/srt-slurm at the pinned commit 52e697d (nginx fd-limit fix on origin/main, Apr 2026). Pinning to a SHA instead of a moving branch keeps benchmark runs reproducible across upstream churn. Rename the helper's SRT_BRANCH env var to SRT_REF for accuracy (it accepts any git ref, not just a branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop env-var override for SRT_REPO_URL / SRT_REF — every benchmark run must use the same pinned srtctl, no ad-hoc overrides at the call site. Bumping the pin is now a one-line edit to benchmark_lib.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…of-life)
Stop relying on srt-slurm's bundled `benchmark.type: sa-bench` (which ships
its own copy of bench_serving.py inside the upstream repo) and instead use
`benchmark.type: custom` to run *this* repo's utils/bench_serving against the
already-ready frontend. Avoids dual-maintaining the bench client.
Plumbing:
- benchmarks/multi_node/srt_bench.sh: thin wrapper that mirrors sa-bench's
per-conc warmup-then-bench loop, writes results to the same
/logs/sa-bench_isl_<ISL>_osl_<OSL>/results_concurrency_<N>_gpus_<TOT>_ctx_<P>_gen_<D>.json
layout the launcher result-harvesters already grep, with conc list parsed
from x-separated env (e.g. "128x256x1024").
- Recipe shape: add `container_mounts: { $INFMAX_WORKSPACE: /infmax-workspace }`
+ replace `benchmark: { type: sa-bench, ... }` with
`benchmark: { type: custom, command: "bash /infmax-workspace/...", env: {...} }`.
Migrated as proof-of-life:
- dsr1/trtllm/b200-fp4/1k1k mtp ctx1_gen2_dep8_batch64_eplb0_mtp2 (TRT-LLM)
- dsr1/sglang/gb200-fp4/1k1k stp low-latency (SGLang)
- dsv4/vllm/gb200-fp4/1k1k stp disagg-gb200-1p1d-dep8-tep8 (vLLM)
Remaining ~360 recipes still use sa-bench; they migrate in a follow-up once
this triplet runs end-to-end on a real cluster.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two pieces, one commit:
1. benchmark_lib.sh's run_benchmark_serving() gains optional pass-throughs
for --tokenizer / --endpoint / --dataset-name / --dataset-path so the
multi-node srt_bench.sh wrapper can reuse it instead of forking its own
command-build. (--request-rate stays hardcoded "inf" — no recipe-level
override.) ~50 lines of duplicated shell deleted from srt_bench.sh.
2. Recipe `benchmark.env` blocks lose every variable that is already
exported by .github/workflows/benchmark-multinode-tmpl.yml at the
workflow step level: MODEL, ISL, OSL, CONC_LIST, DISAGG, RANDOM_RANGE_RATIO.
Those propagate down through srtctl → srun (default --export=ALL) → pyxis
into the bench container, so srt_bench.sh reads them directly. Recipes
now only carry per-recipe topology knobs (PREFILL_GPUS / DECODE_GPUS /
TOTAL_GPUS — used in the result filename) plus the rare overrides.
Tokenizer is hardcoded to /model — srtctl's RuntimeContext.create
unconditionally bind-mounts the local model dir at that path in every
container, so AutoTokenizer.from_pretrained("/model") loads from the same
files the engine is serving. No HF Hub egress, works for HF-id and alias-
only `model:` values alike, no `TOKENIZER_PATH` knob in recipes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er/--endpoint Walked back the --dataset-name / --dataset-path additions to run_benchmark_serving — both default cleanly (random / unset) for every multi-node throughput sweep we run, so the pass-throughs were dead weight. srt_bench.sh stops setting DATASET_NAME / DATASET_PATH from env. Kept --tokenizer (srt_bench points it at /model since --model is the served-model-name alias, not a HF id) and --endpoint (recipes may need /v1/chat/completions for chat-template-only request paths). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same behavior, fewer lines: collapse the two-step suffix split into a
single ${RECIPE#"${RECIPE%%:*}"} parameter expansion. 12 active lines
become 5. No semantic change — verified parsing for plain paths,
:override, and :zip_override_<name>[N] forms.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ression Upstream commit 52e697d (#108 "fix(nginx): raise file descriptor limit for nginx workers") prepends `ulimit -n 1048576 &&` to the nginx srun command. On clusters whose container inherits a sub-1M RLIMIT_NOFILE hard limit from slurmd/PAM, the bash builtin's setrlimit fails with EPERM (raising the hard rlimit needs CAP_SYS_RESOURCE in the init user namespace, which pyxis --container-remap-root does not grant). The `&&` short-circuits and nginx never starts — caught when re-running dsr1-fp4-gb200-dynamo-sglang. Pin back to 698590e ("feat(config): cluster-wide default_bash_preamble for ulimits and the like (#104)"), the immediately prior commit, where nginx runs without the chained ulimit. Bump forward once upstream softens the ulimit to `|| true` or makes it opt-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the temporary rollback (698590e) with the upstream fix branch. 425b486 is the tip of NVIDIA/srt-slurm's `ishan-rework-nginx`, which makes the nginx ulimit + nginx.conf `worker_rlimit_nofile` directive opt-in via a new `frontend.nginx_raise_ulimit` field (default false). Without us opting in, nginx runs without the EPERM-prone bump that #108 introduced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…letions Upstream sa-bench used `--backend dynamo --endpoint /v1/completions`, but this repo's benchmark_serving.py doesn't have a `dynamo` backend choice (it has tgi/vllm/lmdeploy/deepspeed-mii/openai/openai-chat/tensorrt-llm/ scalellm/sglang). The dynamo frontend exposes a generic OpenAI-compatible API regardless of the underlying engine, so `openai` is the right canonical default. Recipes that need /v1/chat/completions can override via ENDPOINT. Also drop the unconditional `--endpoint /v1/completions` — bench_serving.py already defaults to that, and we now only pass --endpoint when ENDPOINT is non-empty (matches single-node bench scripts that don't pass it at all). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both fixes we wanted are now on origin/main: * #110 nginx-rework-ulimit (Ishan): gates the 1M nofile bump behind opt-in frontend.nginx_raise_ulimit. Default off, fixes clusters whose container RLIMIT_NOFILE hard cap < 1M. * #111 (cam): demotes the per-srun command logger.info to logger.debug so the 5KB fingerprint heredoc stops dominating orchestrator logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that the proof-of-life recipe (dsr1-fp4-gb200-dynamo-sglang low-latency,
conc 4/8/32) ran clean end-to-end on a real cluster, sweep the rest of the
tree onto the new shape so all multi-node throughput sweeps drive
utils/bench_serving/benchmark_serving.py via benchmarks/multi_node/srt_bench.sh
instead of srt-slurm's bundled sa-bench client.
Each migrated recipe replaces:
benchmark:
type: "sa-bench"
isl: …
osl: …
concurrencies: …
req_rate: …
[use_chat_template: false]
with:
container_mounts:
"$INFMAX_WORKSPACE": "/infmax-workspace"
benchmark:
type: "custom"
command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
env:
[MODEL_NAME: "..."] # only when server's served-model-name diverges
# from the master-yaml `model:` value
PREFILL_GPUS: "..." # per prefill worker (filename component)
DECODE_GPUS: "..." # per decode worker (filename component)
TOTAL_GPUS: "..." # sum across all workers (filename component)
[USE_CHAT_TEMPLATE: "false"] # only carried over when set in original
GPU counts derived from each recipe's `resources:` block — uses
gpus_per_prefill / gpus_per_decode when set, else falls back to
nodes * gpus_per_node / workers. MODEL_NAME override added on the 59
sglang recipes whose backend.sglang_config.served-model-name is
"deepseek-ai/DeepSeek-R1" while master-yaml `model:` is the more specific
"deepseek-ai/DeepSeek-R1-0528" / "nvidia/DeepSeek-R1-0528-NVFP4-v2"
revision tag.
Skipped:
- 3 sglang multi-override base files (1k1k.yaml / 8k1k.yaml under
dsr1/sglang/b200-fp{4,8}/) — their `benchmark:` lives nested under
`base:` and gets sa-bench-style overrides per `:override[N]` reference.
Migrating them needs a separate pass that handles the override-merge
semantics; their 26 master-yaml refs continue to dispatch via srt-slurm's
bundled sa-bench until then. Tracked as follow-up.
Validation: schema accepts all 81 master-yaml entries, 149/149 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # .github/configs/nvidia-master.yaml # benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep8-dep16.yaml # benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml # benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16.yaml # benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml # benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml # benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml # benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16.yaml # benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-7p1d-dep8-dep16.yaml # benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-latency.yaml # benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-middle-curve.yaml # benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt-megamoe.yaml # benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt.yaml # benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-mid-curve.yaml # benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml # benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-1p1d-dep8-tep8.yaml # benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-3p1d-dep8-dep16.yaml # runners/launch_gb200-nv.sh
| - No extra fields besides the ones listed may be specified, or else the benchmarks will fail to run. | ||
| - Setting the fields above, particularly `ep` and `dp-attn`, only guarantee that the respective values will be passed as environment variables to the benchmark scripts! Actually using those environment variables is an implementation detail at the level of the benchmark Bash script. | ||
|
|
||
| ## Multi-node srt-slurm recipes |
There was a problem hiding this comment.
🔴 All 12 trtllm/b300-fp8/*/disagg/mtp/*.yaml recipes declare model.precision: "fp4" on line 6 even though they live under b300-fp8/ with model.path: "dsr1-fp8" (FP8 weights). The 13 sibling STP recipes in the same b300-fp8 tree correctly declare "fp8", so this is a uniform copy-paste error from the b300-fp4 MTP templates. Fix: change precision: "fp4" → "fp8" on line 6 of every file under benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/{1k1k,8k1k}/disagg/mtp/.
Extended reasoning...
What the bug is
Every recipe under benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/*/disagg/mtp/ (12 files total — 6 in 1k1k/, 6 in 8k1k/) declares model.precision: "fp4" on line 6, while at the same time setting model.path: "dsr1-fp8" (FP8 weights) and being placed under the directory b300-fp8/. Their 13 sibling STP recipes in the same b300-fp8/ tree all correctly declare precision: "fp8", and the corresponding b300-fp4/ MTP recipes correctly declare "fp4". Pattern-wise, this is a uniform copy-paste error from the b300-fp4 MTP templates that was missed when adapting them for FP8.
How it manifests
grep -rn 'precision:' benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/ shows 12/12 MTP recipes with fp4 and 13/13 STP recipes with fp8. For example, b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml:
model:
path: "dsr1-fp8" # FP8 weights
container: "dynamo-trtllm"
precision: "fp4" # ← wrong, should be fp8Step-by-step proof
- The recipe filename hashing in
benchmark-multinode-tmpl.ymlincludes${{ env.PRECISION }}(RESULT_FILENAME=…_${PRECISION}_${FRAMEWORK}_…). PRECISIONis sourced from theprecisionfield of the recipe / master-yaml entry, per CONFIGS.md.- For an entry that points at
trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml, the recipe-sideprecisionreads"fp4", so the result filename and any tag forwarded tosrtctlwill sayfp4even though the container actually loadsdsr1-fp8weights. - Every dashboard / aggregator that filters or groups by precision (which is the entire point of the field per CONFIGS.md) will then mix these FP8 b300 MTP runs into FP4 buckets.
Why existing code does not prevent it
The schema validator only verifies that the recipe file path resolves on disk; it does not cross-check that model.path (e.g. dsr1-fp8) is consistent with model.precision (fp4) or with the directory the recipe lives in. Nothing else in the diff (workflow templates, bench script) inspects precision for sanity — it is purely a metadata/tagging field.
Impact
Result mis-classification across an entire framework-precision-mode subtree (12 recipes). Anything that consumes the precision tag — dashboards, sweep selection, result aggregation — will treat these b300 FP8 MTP runs as FP4. Not a hard runtime breakage, but every result generated from these recipes is mislabeled, which is more than cosmetic for a benchmarking repo whose primary output is comparable numbers across precisions.
Fix
For each of the 12 affected files, change line 6 from precision: "fp4" to precision: "fp8". A single sed -i 's/precision: "fp4"/precision: "fp8"/' benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/{1k1k,8k1k}/disagg/mtp/*.yaml would do it. Files:
b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yamlb300-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch128_eplb0_mtp1_2560.yamlb300-fp8/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp2_720.yamlb300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch16_eplb0_mtp3_160.yamlb300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_10.yamlb300-fp8/1k1k/disagg/mtp/ctx3_gen2_dp8_batch512_eplb0_mtp1_11264.yamlb300-fp8/8k1k/disagg/mtp/ctx1_gen1_dp8_batch8_eplb0_mtp3_72.yamlb300-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch16_eplb0_mtp3_40.yamlb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch1_eplb0_mtp3_8.yamlb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch4_eplb0_mtp3_20.yamlb300-fp8/8k1k/disagg/mtp/ctx2_gen1_dp8_batch16_eplb0_mtp3_144.yamlb300-fp8/8k1k/disagg/mtp/ctx4_gen1_dp8_batch64_eplb0_mtp2_512.yaml
(Comment placed on CONFIGS.md because the individual recipe files exceed the diff-comment threshold for line-anchored review comments.)
| clone_and_install_srtctl() { | ||
| local repo_url="https://github.com/NVIDIA/srt-slurm.git" | ||
| # Pinned to NVIDIA/srt-slurm@main — currently 1372a10. Includes: | ||
| # * #110 nginx-rework-ulimit: gates `ulimit -n 1048576` + worker_rlimit_nofile | ||
| # behind opt-in `frontend.nginx_raise_ulimit` (we don't opt in). | ||
| # * #111 srun command line log demoted INFO -> DEBUG (5KB fingerprint | ||
| # heredoc no longer dominates orchestrator log). | ||
| local ref="1372a10c493e3fd757f342d8516a5a91c30fe6ce" | ||
| local repo_dir="${SRT_REPO_DIR:-srt-slurm}" | ||
| local uv_install_dir="${UV_INSTALL_DIR:-${HOME}/.local/bin}" | ||
| local uv_venv_dir="${UV_VENV_DIR:-.venv}" |
There was a problem hiding this comment.
🔴 The new clone_and_install_srtctl helper declares local repo_dir="${SRT_REPO_DIR:-srt-slurm}" (benchmark_lib.sh:926) and never exports the resolved value, but four launchers (launch_b200-dgxc.sh:52, launch_b300-nv.sh:53, launch_h100-dgxc-slurm.sh:67, launch_h200-dgxc-slurm.sh:60) still read $SRT_REPO_DIR after the helper returns to compute SRTCTL_ROOT="${GITHUB_WORKSPACE}/${SRT_REPO_DIR}". With SRT_REPO_DIR unset in the caller, that becomes ${GITHUB_WORKSPACE}/ (trailing slash, no subdir) and gets heredoc-baked into srtslurm.yaml as srtctl_root:, pointing srtctl at the workspace root rather than the cloned srt-slurm/. Easiest fix is export SRT_REPO_DIR="$repo_dir" inside the helper; alternatives are dropping local or hardcoding the literal in each launcher (gb200/gb300 already do).
Extended reasoning...
What the bug is
Commit 89bf3e3 (runners: factor srt-slurm clone+srtctl install into benchmark_lib helper) extracts the per-launcher SRT_REPO_DIR="srt-slurm" assignment and git clone block into a new clone_and_install_srtctl helper in benchmarks/benchmark_lib.sh. Inside the helper, the resolved value is captured as a function-local:
clone_and_install_srtctl() {
...
local repo_dir="${SRT_REPO_DIR:-srt-slurm}" # benchmark_lib.sh:926
...
}local confines the binding to the function scope, and the helper never exports SRT_REPO_DIR back. Pre-89bf3e3, every launcher set SRT_REPO_DIR="srt-slurm" inline at the top before cloning — that assignment is what the launchers below were reading. A repo-wide grep for SRT_REPO_DIR shows it is set nowhere outside the helper, so every reader after the call sees an empty value.
The dangling readers
Four launchers still have the old ${SRT_REPO_DIR} read after switching to the helper:
runners/launch_b200-dgxc.sh:52—SRTCTL_ROOT="${GITHUB_WORKSPACE}/${SRT_REPO_DIR}"runners/launch_b300-nv.sh:53— samerunners/launch_h100-dgxc-slurm.sh:67— samerunners/launch_h200-dgxc-slurm.sh:60— same
SRTCTL_ROOT is then heredoc-substituted into the generated srtslurm.yaml:
# Path to srtctl repo root (where the configs live)
srtctl_root: "${SRTCTL_ROOT}"The two gb-prefixed launchers (launch_gb200-nv.sh:147, launch_gb300-nv.sh:48) hardcode ${GITHUB_WORKSPACE}/srt-slurm and so escape the SRTCTL_ROOT regression — they only have a cosmetic empty-path in their echo "Configs available at: $SRT_REPO_DIR/" (gb200:144, gb300:45, also h100:47, h200:40).
Step-by-step proof for launch_b200-dgxc.sh
- Script enters the
IS_MULTINODEbranch.SRT_REPO_DIRis unset in the caller env. - Line 36:
UV_INSTALL_DIR="..." UV_VENV_DIR="..." clone_and_install_srtctl || exit 1. The helper doeslocal repo_dir="${SRT_REPO_DIR:-srt-slurm}"→repo_dir="srt-slurm"inside the function only. Helper clones into./srt-slurm,cds into it, sets up the venv. Returns 0. - Back in the caller,
SRT_REPO_DIRis still unset (thelocalbinding evaporated; nothing was exported). - Line 52:
SRTCTL_ROOT="${GITHUB_WORKSPACE}/${SRT_REPO_DIR}"expands toSRTCTL_ROOT="/github/workspace/"(or wherever$GITHUB_WORKSPACEpoints), with empty subdir and trailing slash. - Lines 54–75: heredoc writes
srtslurm.yamlwith literalsrtctl_root: "/github/workspace/". srtctl apply -f $CONFIG_FILEreadssrtctl_rootfrom this YAML to locate the srt-slurm repo (configs/recipes). It now points at the workspace root, not${GITHUB_WORKSPACE}/srt-slurm/. cwd happens to be correct because the helpercdd into the cloned repo, but the YAML pointer that downstream machinery uses is wrong.
Impact
Affects every multi-node benchmark run on b200-dgxc, b300-nv, h100-dgxc-slurm, and h200-dgxc-slurm runners — the four srt-slurm launchers that this PR refactored. gb200/gb300 are unaffected for SRTCTL_ROOT (cosmetic echo only).
Fix
Cleanest fix is in the helper itself — change local repo_dir=... to export SRT_REPO_DIR="${SRT_REPO_DIR:-srt-slurm}" (and use $SRT_REPO_DIR in the rest of the function). That preserves the helper API, restores the pre-PR contract callers depend on, and keeps the gb200/gb300 hardcoded paths working unchanged. Alternatives: have each affected launcher set SRT_REPO_DIR="srt-slurm" inline before calling the helper, or hardcode the literal in the four SRTCTL_ROOT= lines (consistent with gb200/gb300).
| @@ -0,0 +1,148 @@ | |||
| name: b200-fp8-mtp-low-latency-tep8-1p-1d | |||
There was a problem hiding this comment.
🟡 The MTP recipe name fields in three files don't match their actual resource topology, looking like copy-paste artifacts from sa-bench templates that weren't updated when topologies changed. Specifically: 8k1k_mtp_lowlat_0.yaml says ...-1p-1d but is 1p-3d (TOTAL_GPUS=32); 8k1k_mtp_maxtpt_0.yaml says ...-1p-1d but is 1p-2d (TOTAL_GPUS=24); 8k1k_mtp_maxtpt_1.yaml says ...-1p-2d but is 1p-1d (TOTAL_GPUS=16). The peer STP files in the sibling stp/ directory are correctly named for the same topologies, so the fix is just to align the name field with the resources block — purely a labeling/observability bug since the runtime topology is driven by the resources block.
Extended reasoning...
What's wrong
Three new MTP recipe yamls in benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/ carry top-level name: fields that are inconsistent with the topology declared in the resources: block of the same file:
| File | name says |
resources/TOTAL_GPUS imply | Mismatch |
|---|---|---|---|
8k1k_mtp_lowlat_0.yaml |
b200-fp8-mtp-low-latency-tep8-1p-1d |
prefill=1, decode=3, TOTAL_GPUS=32 → 1p-3d | name claims 1d, resources have 3d |
8k1k_mtp_maxtpt_0.yaml |
b200-fp8-mtp-max-tpt-dep8-1p-1d |
prefill=1, decode=2, TOTAL_GPUS=24 → 1p-2d | name claims 1d, resources have 2d |
8k1k_mtp_maxtpt_1.yaml |
b200-fp8-mtp-max-tpt-dep8-1p-2d |
prefill=1, decode=1, TOTAL_GPUS=16 → 1p-1d | name claims 2d, resources have 1d |
Why the resources block is the source of truth
The peer STP recipes in the sibling stp/ directory (8k1k_stp_lowlat_0.yaml, 8k1k_stp_maxtpt_0.yaml, 8k1k_stp_maxtpt_1.yaml) carry correctly matched names for the identical topologies — ...-1p-3d, ...-1p-2d, ...-1p-1d respectively. This isolates the issue to the MTP side and rules out an alternative naming convention. Cross-checking against the parent 1k1k recipe (b200-fp8/1k1k/disagg/1k1k.yaml's zip_override_mtp_lowlat/zip_override_mtp_maxtpt sections) further confirms the convention: 1p-Nd always tracks the number of decode workers.
Step-by-step proof for 8k1k_mtp_lowlat_0.yaml
- Line 1:
name: b200-fp8-mtp-low-latency-tep8-1p-1d— name suffix claims 1 decode node. resources.prefill_nodes: 1,resources.prefill_workers: 1→ 1 prefill worker × 8 GPUs/node = 8 prefill GPUs.resources.decode_nodes: 3,resources.decode_workers: 3→ 3 decode workers × 8 GPUs/node = 24 decode GPUs.- Total: 8 + 24 = 32 GPUs, matching
TOTAL_GPUS: "32"inbenchmark.env. - The peer STP file at the same scale (
8k1k_stp_lowlat_0.yaml) has identicaldecode_nodes/decode_workers: 3andTOTAL_GPUS: "32", with nameb200-fp8-stp-low-latency-tp8-1p-3d— clearly the right convention. - Therefore the MTP file's
-1p-1dsuffix is a typo for-1p-3d. The same logic flips formaxtpt_0(should be-1p-2d) andmaxtpt_1(should be-1p-1d) — the two maxtpt names look swapped.
Impact
The name field is used by srtctl to generate the SLURM job name and by downstream observability/result-mapping tooling (logs, dashboards). Benchmark execution itself is unaffected because the actual topology comes from resources.* and the result filename uses PREFILL_GPUS/DECODE_GPUS/TOTAL_GPUS, all of which are correct. The damage is scoped to humans interpreting SLURM/log output: runs labeled ...-1p-1d will silently appear as 1p-2d or 1p-3d in dashboards, complicating result attribution across the maxtpt scale sweep where the 0/1 names are also swapped relative to each other.
Fix
Update the three name: fields to match the resource topology:
8k1k_mtp_lowlat_0.yamlline 1 →name: b200-fp8-mtp-low-latency-tep8-1p-3d8k1k_mtp_maxtpt_0.yamlline 1 →name: b200-fp8-mtp-max-tpt-dep8-1p-2d8k1k_mtp_maxtpt_1.yamlline 1 →name: b200-fp8-mtp-max-tpt-dep8-1p-1d
No other changes are needed; the resources, TOTAL_GPUS, and benchmark.env blocks are already self-consistent.
Summary
Pull every srt-slurm recipe we depend on into this repo and run multi-node throughput benchmarks against our own bench client instead of upstream srt-slurm's bundled
sa-bench.What changes
NVIDIA/srt-slurm@sa-submission-q2-2026intobenchmarks/multi_node/srt-slurm-recipes/. Layout:<model>/<framework>/<hw>-<precision>/<isl><osl>/<agg|disagg>/<stp|mtp>/<recipe>.yamlso a sweep contributor can find every recipe in a given cell without grepping.recipe:field on master-yaml search-space entries. Replaces theprefill.additional-settings: ["CONFIG_FILE=recipes/..."]env-var hack. Schema validates the file exists on disk.benchmark.type: custom+command: bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh, which is a thin loop on top of the existingrun_benchmark_serving()shim that single-node scripts use. Removes the dual-maintain burden onbench_serving.py.NVIDIA/srt-slurm@main(set inbenchmark_lib.sh::clone_and_install_srtctl). Drops the legacyishandhanani/srt-slurm@sa-submission-q1-2026fallback for gb200 dsr1 sglang/trt.uvinstall + image-name sanitization are factored into helpers (clone_and_install_srtctl,sanitize_image_filename) inbenchmarks/benchmark_lib.sh.How env reaches the bench container
benchmark-multinode-tmpl.ymlalready exportsMODEL/ISL/OSL/CONC_LIST/DISAGG/RANDOM_RANGE_RATIOat the workflow step. These propagate viabash launcher.sh→srtctl apply→subprocess.Popen(env=None)→srun(default--export=ALL) → pyxis → bench container. So recipes only carry per-recipe knobs inbenchmark.env:--tokenizer /modelis hardcoded in the wrapper because srt-slurm'sRuntimeContext.createauto-mounts the model directory at/modelinside every container. Avoids HF Hub egress and works for HF-id and alias-onlymodel:values alike.Validation
Out of scope (follow-ups)
dsr1/sglang/b200-fp{4,8}/.../1k1k.yaml,8k1k.yaml) keep their:override[N]mechanism and stay on sa-bench. Their 26 master-yaml refs continue to dispatch via the bundled sa-bench until they're split into per-variant standalone files.max_attempts: 720, h100dist-timeout: 1800, h200 health_check rewrite) are still in-place; can move into recipes proper as a separate cleanup.Upstream srt-slurm contributions made along the way
worker_rlimit_nofileopt-in viafrontend.nginx_raise_ulimitfield. Unblocks our cluster (RLIMIT_NOFILE hard cap < 1M). Merged.