Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions .github/configs/runners.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -51,19 +51,19 @@
- 'h200-dgxc-slurm_13'
b200:
- 'b200-cw_00'
- 'b200-cw_01'
- 'b200-nb_0'
- 'b200-nb_1'
- 'b200-dgxc_0'
- 'b200-dgxc_1'
- 'b200-dgxc_2'
- 'b200-dgxc_3'
- 'b200-dgxc_4'
- 'b200-dgxc_5'
- 'b200-dgxc_6'
- 'b200-dgxc_7'
- 'b200-dgxc_8'
- 'b200-dgxc_9'
- 'b200-dgxc_00'
- 'b200-dgxc_01'
- 'b200-dgxc_02'
- 'b200-dgxc_03'
- 'b200-dgxc_04'
- 'b200-dgxc_05'
- 'b200-dgxc_06'

Check failure on line 63 in .github/configs/runners.yaml

View check run for this annotation

Claude / Claude Code Review

Multi-node b200 launch script renamed but b200-multinode runners still reference old path

The b200-multinode runner labels at runners.yaml:74-77 still use the `b200-dgxc-slurm_*` prefix, but this PR renamed `runners/launch_b200-dgxc-slurm.sh` → `launch_b200-dgxc.sh`. Since the workflow templates dispatch via `bash ./runners/launch_${RUNNER_NAME%%_*}.sh`, all six `runner: b200-multinode` jobs in nvidia-master.yaml will fail at the launch step with 'No such file or directory'. Fix by renaming the b200-multinode entries to use the `b200-dgxc` prefix (and re-registering the self-hosted r
Comment on lines 54 to +63
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The b200-multinode runner labels at runners.yaml:74-77 still use the b200-dgxc-slurm_* prefix, but this PR renamed runners/launch_b200-dgxc-slurm.shlaunch_b200-dgxc.sh. Since the workflow templates dispatch via bash ./runners/launch_${RUNNER_NAME%%_*}.sh, all six runner: b200-multinode jobs in nvidia-master.yaml will fail at the launch step with 'No such file or directory'. Fix by renaming the b200-multinode entries to use the b200-dgxc prefix (and re-registering the self-hosted runners), or by adding a launch_b200-dgxc-slurm.sh symlink/wrapper.

Extended reasoning...

What the bug is

Commit f5ffe76 performs a 100% rename of runners/launch_b200-dgxc-slurm.shrunners/launch_b200-dgxc.sh (visible as R100 in git show f5ffe76 --name-status). The single-node b200 group in .github/configs/runners.yaml is also relabeled from b200-dgxc_0..9 to zero-padded b200-dgxc_00..09. However the b200-multinode group at runners.yaml:74-77 is left untouched and still reads:

b200-multinode:
- 'b200-dgxc-slurm_6'
- 'b200-dgxc-slurm_7'
- 'b200-dgxc-slurm_8'

The code path that triggers it

All three workflow dispatchers compute the launch script from the runner name with bash parameter expansion that strips everything from the first underscore:

  • .github/workflows/benchmark-multinode-tmpl.yml:177bash ./runners/launch_${RUNNER_NAME%%_*}.sh
  • .github/workflows/benchmark-tmpl.yml:154 — same
  • .github/workflows/profile.yml:167 — same

RUNNER_NAME is set from ${{ runner.name }}, i.e. the literal label from runners.yaml.

Step-by-step proof

  1. A runs-on: b200-multinode job lands on the runner named b200-dgxc-slurm_6.
  2. The workflow sets RUNNER_NAME=b200-dgxc-slurm_6.
  3. ${RUNNER_NAME%%_*} strips from the first _ to the end, yielding b200-dgxc-slurm (the hyphens are not delimiters).
  4. The workflow runs bash ./runners/launch_b200-dgxc-slurm.sh.
  5. That file no longer exists in the tree (ls runners/ shows only launch_b200-dgxc.sh; grep launch_b200-dgxc-slurm returns no hits).
  6. Bash exits with No such file or directory, and the workflow step fails.

Why existing code does not prevent it

Nothing else in the dispatch chain re-maps the prefix; the workflow uses the raw runner-name prefix to pick the script. The renamed launch_b200-dgxc.sh even contains the IS_MULTINODE=true branch, showing the intent was to cover the multinode case — but the runner labels were not updated to match.

Impact

.github/configs/nvidia-master.yaml declares runner: b200-multinode in six places (lines 5, 390, 6623, 6759, 6929, 7128), covering DSR1-FP4 / DSR1-FP8 dynamo-trt and dynamo-sglang multinode benchmarks. Every one of these will fail at the launch step immediately after this PR merges. The h100/h200 stacks are unaffected because their multinode labels (h100-dgxc-slurm_*, h200-dgxc-slurm_*) still match their existing launch_h100-dgxc-slurm.sh / launch_h200-dgxc-slurm.sh scripts — only b200 has the prefix mismatch after this PR.

Fix

Two straightforward options:

  1. Rename the b200-multinode entries to share the new prefix, e.g. b200-dgxc_06/07/08 (and re-register the self-hosted runners under the new labels) so ${RUNNER_NAME%%_*} resolves to b200-dgxc and dispatches to launch_b200-dgxc.sh.
  2. Restore the old name as a wrapper or symlink (runners/launch_b200-dgxc-slurm.shlaunch_b200-dgxc.sh) so the existing labels keep resolving.

- 'b200-dgxc_07'
- 'b200-dgxc_08'
- 'b200-dgxc_09'
- 'b200-dgxc_10'
- 'b200-dgxc_11'
- 'b200-dgxc_12'
Expand Down
File renamed without changes.
Loading