-
Notifications
You must be signed in to change notification settings - Fork 155
chore: upstream srt-slurm recipes + first-class recipe field + custom-bench wrapper #1211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
cquil11
wants to merge
17
commits into
main
Choose a base branch
from
chore/upstream-srt-slurm
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
80d9808
srt-slurm: upstream recipes and add first-class `recipe:` field
cquil11 89bf3e3
runners: factor srt-slurm clone+srtctl install into benchmark_lib helper
cquil11 d29d066
runners: factor image-filename sanitization into benchmark_lib helper
cquil11 77de857
srt-slurm: reorganize recipes by model/framework/hw/seq-len/topology
cquil11 aa430b5
srt-slurm: collapse split <isl>/<osl>/ recipe dirs into <isl><osl>/
cquil11 1ca0696
runners: pin all srt-slurm clones to NVIDIA/srt-slurm@52e697d5
cquil11 0f755d2
runners: hardcode srt-slurm pin in benchmark_lib helper
cquil11 6f99d48
srt-slurm: wire custom-script bench, drop sa-bench dependency (proof-…
cquil11 290fcb6
srt-slurm: simplify custom-bench plumbing — drop redundant recipe env
cquil11 adf8a11
srt-slurm: keep run_benchmark_serving pass-throughs to just --tokeniz…
cquil11 baf8e28
srt-slurm: compress recipe-resolution block in benchmark template
cquil11 d3e9b93
runners: roll srt-slurm pin back one commit to dodge nginx ulimit reg…
cquil11 1241086
runners: bump srt-slurm pin to ishan-rework-nginx (425b486)
cquil11 fecd2de
srt-slurm: default bench backend to `openai`, drop hardcoded /v1/comp…
cquil11 24d118f
runners: bump srt-slurm pin to NVIDIA/main@1372a10
cquil11 792d8aa
srt-slurm: migrate remaining 364 recipes from sa-bench → custom
cquil11 d7dc72f
Merge remote-tracking branch 'origin/main' into chore/upstream-srt-slurm
cquil11 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 All 12
trtllm/b300-fp8/*/disagg/mtp/*.yamlrecipes declaremodel.precision: "fp4"on line 6 even though they live underb300-fp8/withmodel.path: "dsr1-fp8"(FP8 weights). The 13 sibling STP recipes in the same b300-fp8 tree correctly declare"fp8", so this is a uniform copy-paste error from the b300-fp4 MTP templates. Fix: changeprecision: "fp4"→"fp8"on line 6 of every file underbenchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/{1k1k,8k1k}/disagg/mtp/.Extended reasoning...
What the bug is
Every recipe under
benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/*/disagg/mtp/(12 files total — 6 in1k1k/, 6 in8k1k/) declaresmodel.precision: "fp4"on line 6, while at the same time settingmodel.path: "dsr1-fp8"(FP8 weights) and being placed under the directoryb300-fp8/. Their 13 sibling STP recipes in the sameb300-fp8/tree all correctly declareprecision: "fp8", and the correspondingb300-fp4/MTP recipes correctly declare"fp4". Pattern-wise, this is a uniform copy-paste error from theb300-fp4MTP templates that was missed when adapting them for FP8.How it manifests
grep -rn 'precision:' benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/shows 12/12 MTP recipes withfp4and 13/13 STP recipes withfp8. For example,b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml:Step-by-step proof
benchmark-multinode-tmpl.ymlincludes${{ env.PRECISION }}(RESULT_FILENAME=…_${PRECISION}_${FRAMEWORK}_…).PRECISIONis sourced from theprecisionfield of the recipe / master-yaml entry, per CONFIGS.md.trtllm/b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yaml, the recipe-sideprecisionreads"fp4", so the result filename and any tag forwarded tosrtctlwill sayfp4even though the container actually loadsdsr1-fp8weights.Why existing code does not prevent it
The schema validator only verifies that the recipe file path resolves on disk; it does not cross-check that
model.path(e.g.dsr1-fp8) is consistent withmodel.precision(fp4) or with the directory the recipe lives in. Nothing else in the diff (workflow templates, bench script) inspectsprecisionfor sanity — it is purely a metadata/tagging field.Impact
Result mis-classification across an entire framework-precision-mode subtree (12 recipes). Anything that consumes the
precisiontag — dashboards, sweep selection, result aggregation — will treat these b300 FP8 MTP runs as FP4. Not a hard runtime breakage, but every result generated from these recipes is mislabeled, which is more than cosmetic for a benchmarking repo whose primary output is comparable numbers across precisions.Fix
For each of the 12 affected files, change line 6 from
precision: "fp4"toprecision: "fp8". A singlesed -i 's/precision: "fp4"/precision: "fp8"/' benchmarks/multi_node/srt-slurm-recipes/dsr1/trtllm/b300-fp8/{1k1k,8k1k}/disagg/mtp/*.yamlwould do it. Files:b300-fp8/1k1k/disagg/mtp/ctx1_gen1_dp8_batch256_eplb0_mtp1_3072.yamlb300-fp8/1k1k/disagg/mtp/ctx1_gen2_dep8_batch128_eplb0_mtp1_2560.yamlb300-fp8/1k1k/disagg/mtp/ctx1_gen5_dep8_batch16_eplb0_mtp2_720.yamlb300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch16_eplb0_mtp3_160.yamlb300-fp8/1k1k/disagg/mtp/ctx1_gen8_tp8_batch1_eplb0_mtp3_10.yamlb300-fp8/1k1k/disagg/mtp/ctx3_gen2_dp8_batch512_eplb0_mtp1_11264.yamlb300-fp8/8k1k/disagg/mtp/ctx1_gen1_dp8_batch8_eplb0_mtp3_72.yamlb300-fp8/8k1k/disagg/mtp/ctx1_gen2_tp8_batch16_eplb0_mtp3_40.yamlb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch1_eplb0_mtp3_8.yamlb300-fp8/8k1k/disagg/mtp/ctx1_gen4_tp8_batch4_eplb0_mtp3_20.yamlb300-fp8/8k1k/disagg/mtp/ctx2_gen1_dp8_batch16_eplb0_mtp3_144.yamlb300-fp8/8k1k/disagg/mtp/ctx4_gen1_dp8_batch64_eplb0_mtp2_512.yaml(Comment placed on CONFIGS.md because the individual recipe files exceed the diff-comment threshold for line-anchored review comments.)