Skip to content

chore: upstream srt-slurm recipes + first-class recipe field + custom-bench wrapper#1211

Open
cquil11 wants to merge 17 commits intomainfrom
chore/upstream-srt-slurm
Open

chore: upstream srt-slurm recipes + first-class recipe field + custom-bench wrapper#1211
cquil11 wants to merge 17 commits intomainfrom
chore/upstream-srt-slurm

Conversation

@cquil11
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 commented Apr 28, 2026

Summary

Pull every srt-slurm recipe we depend on into this repo and run multi-node throughput benchmarks against our own bench client instead of upstream srt-slurm's bundled sa-bench.

What changes

  • Recipes upstreamed. ~370 recipe yamls copied from NVIDIA/srt-slurm@sa-submission-q2-2026 into benchmarks/multi_node/srt-slurm-recipes/. Layout: <model>/<framework>/<hw>-<precision>/<isl><osl>/<agg|disagg>/<stp|mtp>/<recipe>.yaml so a sweep contributor can find every recipe in a given cell without grepping.
  • First-class recipe: field on master-yaml search-space entries. Replaces the prefill.additional-settings: ["CONFIG_FILE=recipes/..."] env-var hack. Schema validates the file exists on disk.
  • Multi-node bench client lives in this repo. Recipes use benchmark.type: custom + command: bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh, which is a thin loop on top of the existing run_benchmark_serving() shim that single-node scripts use. Removes the dual-maintain burden on bench_serving.py.
  • srt-slurm pin. All clones go to a single pinned commit on NVIDIA/srt-slurm@main (set in benchmark_lib.sh::clone_and_install_srtctl). Drops the legacy ishandhanani/srt-slurm@sa-submission-q1-2026 fallback for gb200 dsr1 sglang/trt.
  • Launcher refactor. Per-launcher copies of the clone+uv install + image-name sanitization are factored into helpers (clone_and_install_srtctl, sanitize_image_filename) in benchmarks/benchmark_lib.sh.

How env reaches the bench container

benchmark-multinode-tmpl.yml already exports MODEL/ISL/OSL/CONC_LIST/DISAGG/RANDOM_RANGE_RATIO at the workflow step. These propagate via bash launcher.shsrtctl applysubprocess.Popen(env=None)srun (default --export=ALL) → pyxis → bench container. So recipes only carry per-recipe knobs in benchmark.env:

container_mounts:
  "$INFMAX_WORKSPACE": "/infmax-workspace"
benchmark:
  type: "custom"
  command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
  env:
    PREFILL_GPUS: "4"
    DECODE_GPUS: "8"
    TOTAL_GPUS: "20"
    # MODEL_NAME: "..."        # only when server's served-model-name diverges from $MODEL
    # USE_CHAT_TEMPLATE: "false" # only when overriding default
    # DSV4: "true"             # only when the recipe needs --dsv4 tokenizer

--tokenizer /model is hardcoded in the wrapper because srt-slurm's RuntimeContext.create auto-mounts the model directory at /model inside every container. Avoids HF Hub egress and works for HF-id and alias-only model: values alike.

Validation

  • ✅ Schema validates 82 master-yaml entries
  • ✅ 149/149 matrix-logic tests pass
  • ✅ End-to-end real-cluster run on B200 / sglang / dsr1-fp4 / 1k/1k @ conc=4×8×32 (run 25079718885)

Out of scope (follow-ups)

  • 3 sglang multi-override base files (dsr1/sglang/b200-fp{4,8}/.../1k1k.yaml, 8k1k.yaml) keep their :override[N] mechanism and stay on sa-bench. Their 26 master-yaml refs continue to dispatch via the bundled sa-bench until they're split into per-variant standalone files.
  • Cluster-tuning sed patches in launchers (max_attempts: 720, h100 dist-timeout: 1800, h200 health_check rewrite) are still in-place; can move into recipes proper as a separate cleanup.

Upstream srt-slurm contributions made along the way

  • NVIDIA/srt-slurm#111: demote per-srun command line from INFO to DEBUG (5KB fingerprint heredoc was dominating logs). Merged.
  • NVIDIA/srt-slurm#110 (Ishan): make nginx ulimit + worker_rlimit_nofile opt-in via frontend.nginx_raise_ulimit field. Unblocks our cluster (RLIMIT_NOFILE hard cap < 1M). Merged.

cquil11 and others added 3 commits April 28, 2026 09:50
Recipes referenced from NVIDIA/srt-slurm@sa-submission-q2-2026 are now tracked
under benchmarks/multi_node/srt-slurm-recipes/, mirroring the upstream
`recipes/` layout. The master-yaml plumbing for selecting one is hoisted out
of `prefill.additional-settings: ["CONFIG_FILE=recipes/..."]` into a
first-class `recipe:` field on the search-space entry, validated against
on-disk paths so unknown recipes fail fast at sweep generation. The benchmark
template resolves it to an absolute scratch-copy path passed to launchers as
CONFIG_FILE, so launcher behavior is unchanged otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six launchers each carried a ~22-line copy of the same git-clone, uv-install,
venv-create, srtctl-install sequence. Lift it into clone_and_install_srtctl()
in benchmarks/benchmark_lib.sh, parameterized by SRT_REPO_URL/SRT_BRANCH and
UV_INSTALL_DIR/UV_VENV_DIR env vars so each launcher can keep its workspace-
vs-NFS-vs-default-HOME placement decisions explicit at the call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lift the `echo "$IMAGE" | sed 's/[/:@#]/_/g'` slug used to name squash files
out of 13 launchers and into sanitize_image_filename() in benchmark_lib.sh.
Cluster-specific separator (h100/h200-dgxc-slurm use '+' instead of '_') is
expressed as the second arg, and the nvcr.io/-prefix-strip variant becomes
`sanitize_image_filename "${IMAGE#nvcr.io/}" +` rather than a sed pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

cquil11 and others added 13 commits April 28, 2026 10:54
Restructure benchmarks/multi_node/srt-slurm-recipes/ from the upstream's
heterogeneous layout into a uniform tree:

  <model>/<framework>/<hw>-<precision>/<isl>/<osl>/<agg|disagg>/<stp|mtp>/<recipe>.yaml

so a sweep contributor can navigate to dsr1/trtllm/b200-fp4/1k/1k/disagg/mtp/
and immediately see every recipe that fits that cell. The 3 sglang
multi-override files that span both stp and mtp are parked one level
shallower (no trailing stp|mtp/), since the override section selects spec.

365 files moved, 388 active + 5 commented recipe references rewritten,
schema validation + tests still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per request, drop the awkward `1k/1k/` two-segment intermediate in the
recipe tree in favor of `1k1k/`. New shape:

  <model>/<framework>/<hw>-<precision>/<isl><osl>/<agg|disagg>/<stp|mtp>/<recipe>.yaml

370 files renamed, 393 recipe references in nvidia-master.yaml rewritten,
schema validation + tests still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the ishandhanani/srt-slurm@sa-submission-q1-2026 fallback in
launch_gb200-nv.sh — every launcher now clones NVIDIA/srt-slurm at the
pinned commit 52e697d (nginx fd-limit fix on origin/main, Apr 2026).
Pinning to a SHA instead of a moving branch keeps benchmark runs
reproducible across upstream churn.

Rename the helper's SRT_BRANCH env var to SRT_REF for accuracy
(it accepts any git ref, not just a branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop env-var override for SRT_REPO_URL / SRT_REF — every benchmark run
must use the same pinned srtctl, no ad-hoc overrides at the call site.
Bumping the pin is now a one-line edit to benchmark_lib.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…of-life)

Stop relying on srt-slurm's bundled `benchmark.type: sa-bench` (which ships
its own copy of bench_serving.py inside the upstream repo) and instead use
`benchmark.type: custom` to run *this* repo's utils/bench_serving against the
already-ready frontend. Avoids dual-maintaining the bench client.

Plumbing:
- benchmarks/multi_node/srt_bench.sh: thin wrapper that mirrors sa-bench's
  per-conc warmup-then-bench loop, writes results to the same
  /logs/sa-bench_isl_<ISL>_osl_<OSL>/results_concurrency_<N>_gpus_<TOT>_ctx_<P>_gen_<D>.json
  layout the launcher result-harvesters already grep, with conc list parsed
  from x-separated env (e.g. "128x256x1024").
- Recipe shape: add `container_mounts: { $INFMAX_WORKSPACE: /infmax-workspace }`
  + replace `benchmark: { type: sa-bench, ... }` with
  `benchmark: { type: custom, command: "bash /infmax-workspace/...", env: {...} }`.

Migrated as proof-of-life:
- dsr1/trtllm/b200-fp4/1k1k mtp ctx1_gen2_dep8_batch64_eplb0_mtp2  (TRT-LLM)
- dsr1/sglang/gb200-fp4/1k1k stp low-latency                       (SGLang)
- dsv4/vllm/gb200-fp4/1k1k stp disagg-gb200-1p1d-dep8-tep8          (vLLM)

Remaining ~360 recipes still use sa-bench; they migrate in a follow-up once
this triplet runs end-to-end on a real cluster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two pieces, one commit:

1. benchmark_lib.sh's run_benchmark_serving() gains optional pass-throughs
   for --tokenizer / --endpoint / --dataset-name / --dataset-path so the
   multi-node srt_bench.sh wrapper can reuse it instead of forking its own
   command-build. (--request-rate stays hardcoded "inf" — no recipe-level
   override.) ~50 lines of duplicated shell deleted from srt_bench.sh.

2. Recipe `benchmark.env` blocks lose every variable that is already
   exported by .github/workflows/benchmark-multinode-tmpl.yml at the
   workflow step level: MODEL, ISL, OSL, CONC_LIST, DISAGG, RANDOM_RANGE_RATIO.
   Those propagate down through srtctl → srun (default --export=ALL) → pyxis
   into the bench container, so srt_bench.sh reads them directly. Recipes
   now only carry per-recipe topology knobs (PREFILL_GPUS / DECODE_GPUS /
   TOTAL_GPUS — used in the result filename) plus the rare overrides.

Tokenizer is hardcoded to /model — srtctl's RuntimeContext.create
unconditionally bind-mounts the local model dir at that path in every
container, so AutoTokenizer.from_pretrained("/model") loads from the same
files the engine is serving. No HF Hub egress, works for HF-id and alias-
only `model:` values alike, no `TOKENIZER_PATH` knob in recipes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er/--endpoint

Walked back the --dataset-name / --dataset-path additions to
run_benchmark_serving — both default cleanly (random / unset) for every
multi-node throughput sweep we run, so the pass-throughs were dead
weight. srt_bench.sh stops setting DATASET_NAME / DATASET_PATH from env.

Kept --tokenizer (srt_bench points it at /model since --model is the
served-model-name alias, not a HF id) and --endpoint (recipes may need
/v1/chat/completions for chat-template-only request paths).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same behavior, fewer lines: collapse the two-step suffix split into a
single ${RECIPE#"${RECIPE%%:*}"} parameter expansion. 12 active lines
become 5. No semantic change — verified parsing for plain paths,
:override, and :zip_override_<name>[N] forms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ression

Upstream commit 52e697d (#108 "fix(nginx): raise file descriptor limit for
nginx workers") prepends `ulimit -n 1048576 &&` to the nginx srun command.
On clusters whose container inherits a sub-1M RLIMIT_NOFILE hard limit
from slurmd/PAM, the bash builtin's setrlimit fails with EPERM (raising
the hard rlimit needs CAP_SYS_RESOURCE in the init user namespace, which
pyxis --container-remap-root does not grant). The `&&` short-circuits and
nginx never starts — caught when re-running dsr1-fp4-gb200-dynamo-sglang.

Pin back to 698590e ("feat(config): cluster-wide default_bash_preamble for
ulimits and the like (#104)"), the immediately prior commit, where nginx
runs without the chained ulimit. Bump forward once upstream softens the
ulimit to `|| true` or makes it opt-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the temporary rollback (698590e) with the upstream fix branch.
425b486 is the tip of NVIDIA/srt-slurm's `ishan-rework-nginx`, which makes
the nginx ulimit + nginx.conf `worker_rlimit_nofile` directive opt-in via
a new `frontend.nginx_raise_ulimit` field (default false). Without us
opting in, nginx runs without the EPERM-prone bump that #108 introduced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…letions

Upstream sa-bench used `--backend dynamo --endpoint /v1/completions`, but
this repo's benchmark_serving.py doesn't have a `dynamo` backend choice
(it has tgi/vllm/lmdeploy/deepspeed-mii/openai/openai-chat/tensorrt-llm/
scalellm/sglang). The dynamo frontend exposes a generic OpenAI-compatible
API regardless of the underlying engine, so `openai` is the right canonical
default. Recipes that need /v1/chat/completions can override via ENDPOINT.

Also drop the unconditional `--endpoint /v1/completions` — bench_serving.py
already defaults to that, and we now only pass --endpoint when ENDPOINT is
non-empty (matches single-node bench scripts that don't pass it at all).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both fixes we wanted are now on origin/main:
  * #110 nginx-rework-ulimit (Ishan): gates the 1M nofile bump behind opt-in
    frontend.nginx_raise_ulimit. Default off, fixes clusters whose container
    RLIMIT_NOFILE hard cap < 1M.
  * #111 (cam): demotes the per-srun command logger.info to logger.debug so
    the 5KB fingerprint heredoc stops dominating orchestrator logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that the proof-of-life recipe (dsr1-fp4-gb200-dynamo-sglang low-latency,
conc 4/8/32) ran clean end-to-end on a real cluster, sweep the rest of the
tree onto the new shape so all multi-node throughput sweeps drive
utils/bench_serving/benchmark_serving.py via benchmarks/multi_node/srt_bench.sh
instead of srt-slurm's bundled sa-bench client.

Each migrated recipe replaces:

  benchmark:
    type: "sa-bench"
    isl: …
    osl: …
    concurrencies: …
    req_rate: …
    [use_chat_template: false]

with:

  container_mounts:
    "$INFMAX_WORKSPACE": "/infmax-workspace"

  benchmark:
    type: "custom"
    command: "bash /infmax-workspace/benchmarks/multi_node/srt_bench.sh"
    env:
      [MODEL_NAME: "..."]      # only when server's served-model-name diverges
                               # from the master-yaml `model:` value
      PREFILL_GPUS: "..."      # per prefill worker (filename component)
      DECODE_GPUS: "..."       # per decode worker (filename component)
      TOTAL_GPUS: "..."        # sum across all workers (filename component)
      [USE_CHAT_TEMPLATE: "false"]   # only carried over when set in original

GPU counts derived from each recipe's `resources:` block — uses
gpus_per_prefill / gpus_per_decode when set, else falls back to
nodes * gpus_per_node / workers. MODEL_NAME override added on the 59
sglang recipes whose backend.sglang_config.served-model-name is
"deepseek-ai/DeepSeek-R1" while master-yaml `model:` is the more specific
"deepseek-ai/DeepSeek-R1-0528" / "nvidia/DeepSeek-R1-0528-NVFP4-v2"
revision tag.

Skipped:
  - 3 sglang multi-override base files (1k1k.yaml / 8k1k.yaml under
    dsr1/sglang/b200-fp{4,8}/) — their `benchmark:` lives nested under
    `base:` and gets sa-bench-style overrides per `:override[N]` reference.
    Migrating them needs a separate pass that handles the override-merge
    semantics; their 26 master-yaml refs continue to dispatch via srt-slurm's
    bundled sa-bench until then. Tracked as follow-up.

Validation: schema accepts all 81 master-yaml entries, 149/149 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cquil11 cquil11 marked this pull request as ready for review April 29, 2026 17:20
@cquil11 cquil11 requested a review from a team April 29, 2026 17:20
# Conflicts:
#	.github/configs/nvidia-master.yaml
#	benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep8-dep16.yaml
#	benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml
#	benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/1k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16.yaml
#	benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-1p1d-dep8-tep8.yaml
#	benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-2p1d-dep8-dep8-c4096-offload.yaml
#	benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16-c4096-offload.yaml
#	benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-3p1d-dep8-dep16.yaml
#	benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-7p1d-dep8-dep16.yaml
#	benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-latency.yaml
#	benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-low-middle-curve.yaml
#	benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt-megamoe.yaml
#	benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-max-tpt.yaml
#	benchmarks/multi_node/srt-slurm-recipes/dsv4/vllm/gb200-fp4/8k1k/disagg/stp/disagg-gb200-mid-curve.yaml
#	benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml
#	benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-1p1d-dep8-tep8.yaml
#	benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/8k1k/disagg-gb200-3p1d-dep8-dep16.yaml
#	runners/launch_gb200-nv.sh
@cquil11 cquil11 changed the title Chore/upstream srt slurm chore: upstream srt-slurm recipes + first-class recipe field + custom-bench wrapper Apr 29, 2026
- No extra fields besides the ones listed may be specified, or else the benchmarks will fail to run.
- Setting the fields above, particularly `ep` and `dp-attn`, only guarantee that the respective values will be passed as environment variables to the benchmark scripts! Actually using those environment variables is an implementation detail at the level of the benchmark Bash script.

## Multi-node srt-slurm recipes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 All 12 trtllm/b300-fp8/*/disagg/mtp/*.yaml recipes declare model.precision: "fp4" on line 6 even though they live under b300-fp8/ with model.path: "dsr1-fp8" (FP8 weights). The 13 sibling STP recipes in the same b300-fp8 tree correctly declare "fp8", so this is a uniform copy-paste error from the b300-fp4 MTP templates. Fix: change precision: "fp4""fp8" on line 6 of every file under benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/{1k1k,8k1k}/disagg/mtp/.

Extended reasoning...

What the bug is

Every recipe under benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/*/disagg/mtp/ (12 files total — 6 in 1k1k/, 6 in 8k1k/) declares model.precision: "fp4" on line 6, while at the same time setting model.path: "dsr1-fp8" (FP8 weights) and being placed under the directory b300-fp8/. Their 13 sibling STP recipes in the same b300-fp8/ tree all correctly declare precision: "fp8", and the corresponding b300-fp4/ MTP recipes correctly declare "fp4". Pattern-wise, this is a uniform copy-paste error from the b300-fp4 MTP templates that was missed when adapting them for FP8.

How it manifests

grep -rn 'precision:' benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/ shows 12/12 MTP recipes with fp4 and 13/13 STP recipes with fp8. For example, b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml:

model:
  path: "dsr1-fp8"          # FP8 weights
  container: "dynamo-trtllm"
  precision: "fp4"          # ← wrong, should be fp8

Step-by-step proof

  1. The recipe filename hashing in benchmark-multinode-tmpl.yml includes ${{ env.PRECISION }} (RESULT_FILENAME=…_${PRECISION}_${FRAMEWORK}_…).
  2. PRECISION is sourced from the precision field of the recipe / master-yaml entry, per CONFIGS.md.
  3. For an entry that points at trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml, the recipe-side precision reads "fp4", so the result filename and any tag forwarded to srtctl will say fp4 even though the container actually loads dsr1-fp8 weights.
  4. Every dashboard / aggregator that filters or groups by precision (which is the entire point of the field per CONFIGS.md) will then mix these FP8 b300 MTP runs into FP4 buckets.

Why existing code does not prevent it

The schema validator only verifies that the recipe file path resolves on disk; it does not cross-check that model.path (e.g. dsr1-fp8) is consistent with model.precision (fp4) or with the directory the recipe lives in. Nothing else in the diff (workflow templates, bench script) inspects precision for sanity — it is purely a metadata/tagging field.

Impact

Result mis-classification across an entire framework-precision-mode subtree (12 recipes). Anything that consumes the precision tag — dashboards, sweep selection, result aggregation — will treat these b300 FP8 MTP runs as FP4. Not a hard runtime breakage, but every result generated from these recipes is mislabeled, which is more than cosmetic for a benchmarking repo whose primary output is comparable numbers across precisions.

Fix

For each of the 12 affected files, change line 6 from precision: "fp4" to precision: "fp8". A single sed -i 's/precision: "fp4"/precision: "fp8"/' benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/{1k1k,8k1k}/disagg/mtp/*.yaml would do it. Files:

  • b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml
  • b300-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch128_eplb0_mtp1_2560.yaml
  • b300-fp8/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp2_720.yaml
  • b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch16_eplb0_mtp3_160.yaml
  • b300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_10.yaml
  • b300-fp8/1k1k/disagg/mtp/ctx3_gen2_dp8_batch512_eplb0_mtp1_11264.yaml
  • b300-fp8/8k1k/disagg/mtp/ctx1_gen1_dp8_batch8_eplb0_mtp3_72.yaml
  • b300-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch16_eplb0_mtp3_40.yaml
  • b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch1_eplb0_mtp3_8.yaml
  • b300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch4_eplb0_mtp3_20.yaml
  • b300-fp8/8k1k/disagg/mtp/ctx2_gen1_dp8_batch16_eplb0_mtp3_144.yaml
  • b300-fp8/8k1k/disagg/mtp/ctx4_gen1_dp8_batch64_eplb0_mtp2_512.yaml

(Comment placed on CONFIGS.md because the individual recipe files exceed the diff-comment threshold for line-anchored review comments.)

Comment on lines +918 to +928
clone_and_install_srtctl() {
local repo_url="https://github.com/NVIDIA/srt-slurm.git"
# Pinned to NVIDIA/srt-slurm@main — currently 1372a10. Includes:
# * #110 nginx-rework-ulimit: gates `ulimit -n 1048576` + worker_rlimit_nofile
# behind opt-in `frontend.nginx_raise_ulimit` (we don't opt in).
# * #111 srun command line log demoted INFO -> DEBUG (5KB fingerprint
# heredoc no longer dominates orchestrator log).
local ref="1372a10c493e3fd757f342d8516a5a91c30fe6ce"
local repo_dir="${SRT_REPO_DIR:-srt-slurm}"
local uv_install_dir="${UV_INSTALL_DIR:-${HOME}/.local/bin}"
local uv_venv_dir="${UV_VENV_DIR:-.venv}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new clone_and_install_srtctl helper declares local repo_dir="${SRT_REPO_DIR:-srt-slurm}" (benchmark_lib.sh:926) and never exports the resolved value, but four launchers (launch_b200-dgxc.sh:52, launch_b300-nv.sh:53, launch_h100-dgxc-slurm.sh:67, launch_h200-dgxc-slurm.sh:60) still read $SRT_REPO_DIR after the helper returns to compute SRTCTL_ROOT="${GITHUB_WORKSPACE}/${SRT_REPO_DIR}". With SRT_REPO_DIR unset in the caller, that becomes ${GITHUB_WORKSPACE}/ (trailing slash, no subdir) and gets heredoc-baked into srtslurm.yaml as srtctl_root:, pointing srtctl at the workspace root rather than the cloned srt-slurm/. Easiest fix is export SRT_REPO_DIR="$repo_dir" inside the helper; alternatives are dropping local or hardcoding the literal in each launcher (gb200/gb300 already do).

Extended reasoning...

What the bug is

Commit 89bf3e3 (runners: factor srt-slurm clone+srtctl install into benchmark_lib helper) extracts the per-launcher SRT_REPO_DIR="srt-slurm" assignment and git clone block into a new clone_and_install_srtctl helper in benchmarks/benchmark_lib.sh. Inside the helper, the resolved value is captured as a function-local:

clone_and_install_srtctl() {
    ...
    local repo_dir="${SRT_REPO_DIR:-srt-slurm}"   # benchmark_lib.sh:926
    ...
}

local confines the binding to the function scope, and the helper never exports SRT_REPO_DIR back. Pre-89bf3e3, every launcher set SRT_REPO_DIR="srt-slurm" inline at the top before cloning — that assignment is what the launchers below were reading. A repo-wide grep for SRT_REPO_DIR shows it is set nowhere outside the helper, so every reader after the call sees an empty value.

The dangling readers

Four launchers still have the old ${SRT_REPO_DIR} read after switching to the helper:

  • runners/launch_b200-dgxc.sh:52SRTCTL_ROOT="${GITHUB_WORKSPACE}/${SRT_REPO_DIR}"
  • runners/launch_b300-nv.sh:53 — same
  • runners/launch_h100-dgxc-slurm.sh:67 — same
  • runners/launch_h200-dgxc-slurm.sh:60 — same

SRTCTL_ROOT is then heredoc-substituted into the generated srtslurm.yaml:

# Path to srtctl repo root (where the configs live)
srtctl_root: "${SRTCTL_ROOT}"

The two gb-prefixed launchers (launch_gb200-nv.sh:147, launch_gb300-nv.sh:48) hardcode ${GITHUB_WORKSPACE}/srt-slurm and so escape the SRTCTL_ROOT regression — they only have a cosmetic empty-path in their echo "Configs available at: $SRT_REPO_DIR/" (gb200:144, gb300:45, also h100:47, h200:40).

Step-by-step proof for launch_b200-dgxc.sh

  1. Script enters the IS_MULTINODE branch. SRT_REPO_DIR is unset in the caller env.
  2. Line 36: UV_INSTALL_DIR="..." UV_VENV_DIR="..." clone_and_install_srtctl || exit 1. The helper does local repo_dir="${SRT_REPO_DIR:-srt-slurm}"repo_dir="srt-slurm" inside the function only. Helper clones into ./srt-slurm, cds into it, sets up the venv. Returns 0.
  3. Back in the caller, SRT_REPO_DIR is still unset (the local binding evaporated; nothing was exported).
  4. Line 52: SRTCTL_ROOT="${GITHUB_WORKSPACE}/${SRT_REPO_DIR}" expands to SRTCTL_ROOT="/github/workspace/" (or wherever $GITHUB_WORKSPACE points), with empty subdir and trailing slash.
  5. Lines 54–75: heredoc writes srtslurm.yaml with literal srtctl_root: "/github/workspace/".
  6. srtctl apply -f $CONFIG_FILE reads srtctl_root from this YAML to locate the srt-slurm repo (configs/recipes). It now points at the workspace root, not ${GITHUB_WORKSPACE}/srt-slurm/. cwd happens to be correct because the helper cdd into the cloned repo, but the YAML pointer that downstream machinery uses is wrong.

Impact

Affects every multi-node benchmark run on b200-dgxc, b300-nv, h100-dgxc-slurm, and h200-dgxc-slurm runners — the four srt-slurm launchers that this PR refactored. gb200/gb300 are unaffected for SRTCTL_ROOT (cosmetic echo only).

Fix

Cleanest fix is in the helper itself — change local repo_dir=... to export SRT_REPO_DIR="${SRT_REPO_DIR:-srt-slurm}" (and use $SRT_REPO_DIR in the rest of the function). That preserves the helper API, restores the pre-PR contract callers depend on, and keeps the gb200/gb300 hardcoded paths working unchanged. Alternatives: have each affected launcher set SRT_REPO_DIR="srt-slurm" inline before calling the helper, or hardcode the literal in the four SRTCTL_ROOT= lines (consistent with gb200/gb300).

@@ -0,0 +1,148 @@
name: b200-fp8-mtp-low-latency-tep8-1p-1d
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The MTP recipe name fields in three files don't match their actual resource topology, looking like copy-paste artifacts from sa-bench templates that weren't updated when topologies changed. Specifically: 8k1k_mtp_lowlat_0.yaml says ...-1p-1d but is 1p-3d (TOTAL_GPUS=32); 8k1k_mtp_maxtpt_0.yaml says ...-1p-1d but is 1p-2d (TOTAL_GPUS=24); 8k1k_mtp_maxtpt_1.yaml says ...-1p-2d but is 1p-1d (TOTAL_GPUS=16). The peer STP files in the sibling stp/ directory are correctly named for the same topologies, so the fix is just to align the name field with the resources block — purely a labeling/observability bug since the runtime topology is driven by the resources block.

Extended reasoning...

What's wrong

Three new MTP recipe yamls in benchmarks/multi_node/srt-slurm-recipes/dsr1/sglang/b200-fp8/8k1k/disagg/mtp/ carry top-level name: fields that are inconsistent with the topology declared in the resources: block of the same file:

File name says resources/TOTAL_GPUS imply Mismatch
8k1k_mtp_lowlat_0.yaml b200-fp8-mtp-low-latency-tep8-1p-1d prefill=1, decode=3, TOTAL_GPUS=32 → 1p-3d name claims 1d, resources have 3d
8k1k_mtp_maxtpt_0.yaml b200-fp8-mtp-max-tpt-dep8-1p-1d prefill=1, decode=2, TOTAL_GPUS=24 → 1p-2d name claims 1d, resources have 2d
8k1k_mtp_maxtpt_1.yaml b200-fp8-mtp-max-tpt-dep8-1p-2d prefill=1, decode=1, TOTAL_GPUS=16 → 1p-1d name claims 2d, resources have 1d

Why the resources block is the source of truth

The peer STP recipes in the sibling stp/ directory (8k1k_stp_lowlat_0.yaml, 8k1k_stp_maxtpt_0.yaml, 8k1k_stp_maxtpt_1.yaml) carry correctly matched names for the identical topologies — ...-1p-3d, ...-1p-2d, ...-1p-1d respectively. This isolates the issue to the MTP side and rules out an alternative naming convention. Cross-checking against the parent 1k1k recipe (b200-fp8/1k1k/disagg/1k1k.yaml's zip_override_mtp_lowlat/zip_override_mtp_maxtpt sections) further confirms the convention: 1p-Nd always tracks the number of decode workers.

Step-by-step proof for 8k1k_mtp_lowlat_0.yaml

  1. Line 1: name: b200-fp8-mtp-low-latency-tep8-1p-1d — name suffix claims 1 decode node.
  2. resources.prefill_nodes: 1, resources.prefill_workers: 1 → 1 prefill worker × 8 GPUs/node = 8 prefill GPUs.
  3. resources.decode_nodes: 3, resources.decode_workers: 3 → 3 decode workers × 8 GPUs/node = 24 decode GPUs.
  4. Total: 8 + 24 = 32 GPUs, matching TOTAL_GPUS: "32" in benchmark.env.
  5. The peer STP file at the same scale (8k1k_stp_lowlat_0.yaml) has identical decode_nodes/decode_workers: 3 and TOTAL_GPUS: "32", with name b200-fp8-stp-low-latency-tp8-1p-3d — clearly the right convention.
  6. Therefore the MTP file's -1p-1d suffix is a typo for -1p-3d. The same logic flips for maxtpt_0 (should be -1p-2d) and maxtpt_1 (should be -1p-1d) — the two maxtpt names look swapped.

Impact

The name field is used by srtctl to generate the SLURM job name and by downstream observability/result-mapping tooling (logs, dashboards). Benchmark execution itself is unaffected because the actual topology comes from resources.* and the result filename uses PREFILL_GPUS/DECODE_GPUS/TOTAL_GPUS, all of which are correct. The damage is scoped to humans interpreting SLURM/log output: runs labeled ...-1p-1d will silently appear as 1p-2d or 1p-3d in dashboards, complicating result attribution across the maxtpt scale sweep where the 0/1 names are also swapped relative to each other.

Fix

Update the three name: fields to match the resource topology:

  • 8k1k_mtp_lowlat_0.yaml line 1 → name: b200-fp8-mtp-low-latency-tep8-1p-3d
  • 8k1k_mtp_maxtpt_0.yaml line 1 → name: b200-fp8-mtp-max-tpt-dep8-1p-2d
  • 8k1k_mtp_maxtpt_1.yaml line 1 → name: b200-fp8-mtp-max-tpt-dep8-1p-1d

No other changes are needed; the resources, TOTAL_GPUS, and benchmark.env blocks are already self-consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant