Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1794,3 +1794,12 @@
- "DSv4-Pro on MI355X exceeded the 3h cap (STEP CANCELLED DUE TO TIME LIMIT) due to ~30min MoE JIT compile plus slow torch-fallback kernels (SGLANG_HACK_FLASHMLA_BACKEND=torch et al.) from sgl-project/sglang#23608"
- "300 minutes matches the GH Actions outer timeout-minutes cap in benchmark-tmpl.yml"
- "Retriggering dsv4-fp8-mi355x-sglang"

- config-keys:
- dsv4-fp8-mi355x-sglang
description:
- "Bump MI355X SLURM allocation from --time=180 to --time=300 in runners/launch_mi355x-amds.sh"
- "DSv4-Pro on MI355X exceeded the 3h cap (STEP CANCELLED DUE TO TIME LIMIT) due to ~30min MoE JIT compile plus slow torch-fallback kernels (SGLANG_HACK_FLASHMLA_BACKEND=torch et al.) from sgl-project/sglang#23608"
- "300 minutes matches the GH Actions outer timeout-minutes cap in benchmark-tmpl.yml"
- "Retriggering dsv4-fp8-mi355x-sglang"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1148

Check warning on line 1805 in perf-changelog.yaml

View check run for this annotation

Claude / Claude Code Review

Duplicate changelog entry for dsv4-fp8-mi355x-sglang

This new entry duplicates the block added by PR #1147 (immediately above at lines 1790-1796) — all four description bullets are verbatim copies. Since PR #1148's diff only modifies perf-changelog.yaml and does not touch `runners/launch_mi355x-amds.sh`, the bullet "Bump MI355X SLURM allocation from --time=180 to --time=300" misrepresents what this PR actually does. Consider replacing the four bullets with a single `Retriggering dsv4-fp8-mi355x-sglang` bullet (matching the convention at line 1727
Comment on lines +1798 to +1805
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 This new entry duplicates the block added by PR #1147 (immediately above at lines 1790-1796) — all four description bullets are verbatim copies. Since PR #1148's diff only modifies perf-changelog.yaml and does not touch runners/launch_mi355x-amds.sh, the bullet "Bump MI355X SLURM allocation from --time=180 to --time=300" misrepresents what this PR actually does. Consider replacing the four bullets with a single Retriggering dsv4-fp8-mi355x-sglang bullet (matching the convention at line 1727 for PR #1120), and/or backfilling pr-link: #1147 on the existing block instead.

Extended reasoning...

What the bug is. The new entry appended at lines 1798-1805 reuses, word-for-word, the four description bullets from the entry already in the file at lines 1790-1796. Both blocks have the same config-keys: [dsv4-fp8-mi355x-sglang] and the same description. The only differences are that the new block adds pr-link: #1148 and the existing block has no pr-link at all.

Why this is misleading. The first bullet — "Bump MI355X SLURM allocation from --time=180 to --time=300 in runners/launch_mi355x-amds.sh" — describes a concrete code change to a runner script. But git log shows that change was already made by commit d40be91 (PR #1147, titled "Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang"), and the PR metadata for #1148 reports changed-files count=1 with only perf-changelog.yaml touched. So as written, the new entry claims credit for a script change this PR does not contain.

Step-by-step proof.

  1. git log --oneline includes d40be91 Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang (#1147) — confirming PR Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang #1147 already landed the script bump.
  2. The PR diff for Increase timeout #1148 contains exactly one hunk, all in perf-changelog.yaml; nothing in runners/ is modified.
  3. Reading lines 1790-1796 vs 1798-1805 of the post-merge file shows the four description: bullets are byte-identical between the two blocks.
  4. Therefore after merging, the changelog will contain two consecutive entries for the same config-key describing the same change — one (from Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang #1147) without a pr-link, and one (from Increase timeout #1148) with pr-link: #1148 that falsely attributes the SLURM bump to Increase timeout #1148.

Why existing convention doesn't catch this. The file has a precedent for retrigger-only entries: line 1727's PR #1120 entry uses a single bullet "Trigger H100 multinode evals after NVSHEMM fixes" rather than copying the underlying-fix entry's bullets. Following that convention here would have avoided the duplication entirely.

Impact. No runtime impact — perf-changelog.yaml is documentation. But the changelog is the canonical record of what each PR changed, and a duplicate block plus a mis-attributed bullet clutters that record and will confuse anyone trying to reconstruct when the SLURM time-limit was actually bumped.

Suggested fix. Either (a) replace the four-bullet description in this PR with a single Retriggering dsv4-fp8-mi355x-sglang bullet, mirroring the #1120 convention; or (b) drop the new block entirely from this PR and instead add pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1147 to the existing #1147 block at lines 1790-1796 (which is currently the only pr-link-less entry near the bottom of the file).

Loading