Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1809,10 +1809,10 @@

glm5-nvfp4-b200-sglang:
image: lmsysorg/sglang:nightly-dev-cu13-20260328-a27651d5
model: nvidia/GLM-5-NVFP4
model-prefix: glm5
runner: b200
precision: nvfp4
precision: fp4

Check failure on line 1815 in .github/configs/nvidia-master.yaml

View check run for this annotation

Claude / Claude Code Review

Missing benchmark script glm5_fp4_b200.sh after precision rename

The precision rename from `nvfp4` to `fp4` breaks B200 benchmark execution because the corresponding shell script was not renamed. Both `runners/launch_b200-nb.sh` and `runners/launch_b200-dgxc.sh` dynamically construct the benchmark script path using the `PRECISION` env var, so they will now look for `benchmarks/single_node/glm5_fp4_b200.sh`, which does not exist — only `glm5_nvfp4_b200.sh` is present. All GLM-5 NVFP4 B200 benchmark jobs will fail at runtime with "No such file or directory"; th
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The precision rename from nvfp4 to fp4 breaks B200 benchmark execution because the corresponding shell script was not renamed. Both runners/launch_b200-nb.sh and runners/launch_b200-dgxc.sh dynamically construct the benchmark script path using the PRECISION env var, so they will now look for benchmarks/single_node/glm5_fp4_b200.sh, which does not exist — only glm5_nvfp4_b200.sh is present. All GLM-5 NVFP4 B200 benchmark jobs will fail at runtime with "No such file or directory"; the fix is to rename glm5_nvfp4_b200.sh to glm5_fp4_b200.sh (or create the new file) in the same PR.

Extended reasoning...

What the bug is and how it manifests

The PR renames the field value from to for the benchmark configuration in . However, the corresponding benchmark shell script that is expected to exist under was not renamed or created to match the new value. At runtime, the CI/CD runners will attempt to execute a script that does not exist, resulting in a hard failure for every GLM-5 NVFP4 B200 benchmark job.

The specific code path that triggers it

Both (line 20) and (line 39) construct the benchmark script path dynamically:

The environment variable is populated directly from the field in the YAML configuration (confirmed by line 83). With derived from the model prefix (e.g., ), evaluates to , so the full resolved path becomes .

Why existing code doesn't prevent it

There is no guard or existence check before the call — the script simply tries to execute the resolved path directly. The YAML config field and the benchmark shell script filenames are tightly coupled by convention but not enforced at commit or workflow-definition time. The rename of the YAML value is therefore a breaking change unless the shell script file is also renamed.

What the impact would be

Every CI run that triggers the benchmark (on both and runner types) will fail immediately at the script execution step with . No benchmark data will be collected, wasting runner time and blocking any dependent reporting or dashboards.

How to fix it

Rename Error: The following required environment variables are not set:

  • MODEL
  • TP
  • CONC
  • ISL
  • OSL
  • RANDOM_RANGE_RATIO
  • RESULT_FILENAME
  • EP_SIZE to in this same PR. Alternatively, create with equivalent content. The YAML rename alone is incomplete without the corresponding script rename.

Step-by-step proof

  1. PR changes : →
  2. CI workflow reads this YAML and sets in the runner environment ( line 83).
  3. line 20 executes:
  4. benchmarks/single_node/glm5_fp8_b200.sh
    benchmarks/single_node/glm5_nvfp4_b200.sh shows only — is absent.
  5. Shell exits with error: . Job fails.

framework: sglang

Check failure on line 1816 in .github/configs/nvidia-master.yaml

View check run for this annotation

Claude / Claude Code Review

Missing perf-changelog.yaml update (BLOCKING per repo policy)

This PR modifies `.github/configs/nvidia-master.yaml` (changing `precision: nvfp4` → `precision: fp4` for `glm5-nvfp4-b200-sglang`) but does not update `perf-changelog.yaml`. Per the repo's own policy in `.github/workflows/claude-pr-review.yml` lines 122–126, this is a 🔴 **BLOCKING** issue. A changelog entry documenting the precision key rename must be added to `perf-changelog.yaml` with a reference to PR #976 before this PR can merge.
Comment on lines 1812 to 1816
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This PR modifies .github/configs/nvidia-master.yaml (changing precision: nvfp4precision: fp4 for glm5-nvfp4-b200-sglang) but does not update perf-changelog.yaml. Per the repo's own policy in .github/workflows/claude-pr-review.yml lines 122–126, this is a 🔴 BLOCKING issue. A changelog entry documenting the precision key rename must be added to perf-changelog.yaml with a reference to PR #976 before this PR can merge.

Extended reasoning...

What the bug is and how it manifests

The repo enforces a policy (defined in .github/workflows/claude-pr-review.yml, lines 122–126) that whenever .github/configs/nvidia-master.yaml or .github/configs/amd-master.yaml is edited in a PR, perf-changelog.yaml must also be updated in the same PR. The automated Claude PR-review bot will read the changed-file list and flag the absence as a 🔴 BLOCKING issue, leaving a comment that prevents the PR from being considered ready to merge.

The specific code path that triggers it

The diff for this PR touches exactly one file: .github/configs/nvidia-master.yaml (changing precision: nvfp4precision: fp4 for the glm5-nvfp4-b200-sglang config key at lines 1812–1816). The perf-changelog.yaml file is not in the PR's changed-file list.

Why existing code doesn't prevent it

There is no automated CI check that hard-blocks the merge via a status check. The enforcement is done by the Claude PR-review GitHub Actions bot, which posts a blocking review comment. Because it is bot-enforced rather than a required status check, the PR can technically be merged by a maintainer who overrides it — but the policy is explicit and the bot will flag it immediately on review.

Impact

The omission violates the repo's documented change-tracking policy and will cause the automated reviewer to block the PR on its first pass. This adds friction and delays the merge. More broadly, perf-changelog.yaml is the historical record used to audit what configuration parameters changed and when; skipping an entry means this rename (from the incorrect nvfp4 key to the correct fp4 key) will be untracked.

How to fix it

Add an entry to perf-changelog.yaml following the established format seen in other entries (PR #919, #857, etc.):

- config-keys:
    - glm5-nvfp4-b200-sglang
  description:
    - "Fix precision key name from 'nvfp4' to 'fp4' for GLM-5 NVFP4 B200 SGLang config"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/976

Step-by-step proof

  1. .github/workflows/claude-pr-review.yml lines 122–126 state: "If either master config file was edited AND perf-changelog.yaml was NOT edited in the same PR: This is a 🔴 BLOCKING issue."
  2. The PR's changed-files list contains only .github/configs/nvidia-master.yaml.
  3. perf-changelog.yaml is absent from the changed-files list.
  4. Condition (master config edited) = true; Condition (perf-changelog edited) = false.
  5. Therefore the policy is triggered and the bot will block the PR.

multinode: false
seq-len-configs:

Check warning on line 1818 in .github/configs/nvidia-master.yaml

View check run for this annotation

Claude / Claude Code Review

Config key name inconsistent with precision value after fix

The config key `glm5-nvfp4-b200-sglang` still embeds `nvfp4` while `precision: fp4` after this fix, breaking the established `{model-prefix}-{precision}-{hardware}-{framework}` naming pattern. Consider renaming the key to `glm5-fp4-b200-sglang` and updating the corresponding reference in `perf-changelog.yaml`.
Comment on lines 1812 to 1818
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The config key glm5-nvfp4-b200-sglang still embeds nvfp4 while precision: fp4 after this fix, breaking the established {model-prefix}-{precision}-{hardware}-{framework} naming pattern. Consider renaming the key to glm5-fp4-b200-sglang and updating the corresponding reference in perf-changelog.yaml.

Extended reasoning...

What the bug is and how it manifests

The PR correctly changes precision: nvfp4 to precision: fp4, but leaves the YAML key itself as glm5-nvfp4-b200-sglang. After the fix the key name and the precision field are inconsistent: the key says nvfp4 while the precision field says fp4.

The specific code path that triggers it

All other NVFP4-quantized model configs in nvidia-master.yaml use fp4 in their key names, not nvfp4:

  • kimik2.5-fp4-b200-vllm — model: nvidia/Kimi-K2.5-NVFP4, precision: fp4
  • dsr1-fp4-gb200-dynamo-trt — model: nvidia/DeepSeek-R1-0528-NVFP4-v2, precision: fp4

glm5-nvfp4-b200-sglang is the only entry that bakes the vendor tag nvfp4 into the key while the canonical precision field says fp4.

Additionally, perf-changelog.yaml (line 1210) references the old key name and would be a dangling reference if the key is ever renamed downstream.

Why existing code doesn't prevent it

The precision field is authoritative for benchmark script discovery (per CONFIGS.md line 33), so full-sweep workflows continue to work. The convention mismatch is therefore not caught by any hard failure — it's a silent inconsistency.

Impact

Wildcard key filtering such as test-config --config-keys '*-fp4-*' will miss this entry, reducing discoverability. Developers searching for all fp4 configs by key pattern will silently omit this one.

Addressing the refutation

A refutation notes that CONFIGS.md line 25 says the naming convention is "not required" and that no existing tooling mechanically parses key names. Both points are valid: this is not a functional breakage, which is why the severity is nit. However, every other NVFP4 model entry already follows the fp4 convention in the key name, making this the sole outlier after the precision field was corrected. Consistency in human-readable identifiers still has value, especially when perf-changelog.yaml has a cross-reference that would need to stay in sync regardless.

How to fix

Rename the key from glm5-nvfp4-b200-sglang to glm5-fp4-b200-sglang in .github/configs/nvidia-master.yaml, and update the matching reference in perf-changelog.yaml to keep cross-references consistent.

Step-by-step proof

  1. After this PR: key = glm5-nvfp4-b200-sglang, precision field = fp4.
  2. A developer runs test-config --config-keys '*-fp4-b200-sglang' to find all fp4 B200 SGLang configs.
  3. The glob *-fp4-b200-sglang does not match glm5-nvfp4-b200-sglang, so this entry is silently skipped.
  4. Compare: kimik2.5-fp4-b200-vllm correctly matches *-fp4-b200-vllm because its key was set to fp4 from the start.

Expand Down