Skip to content

dsv4-mi355x-sgl#1244

Open
Oseltamivir wants to merge 5 commits intomainfrom
dsv4-mi355x-sgl
Open

dsv4-mi355x-sgl#1244
Oseltamivir wants to merge 5 commits intomainfrom
dsv4-mi355x-sgl

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@Oseltamivir Oseltamivir marked this pull request as ready for review May 1, 2026 00:39
@Oseltamivir Oseltamivir requested a review from a team May 1, 2026 00:39
@Oseltamivir Oseltamivir requested a review from a team May 1, 2026 00:39
Comment on lines +30 to +41
(
cd "$SGL_PR_DIR"
git fetch --depth=1 origin "$SGL_PR_SHA" 2>/dev/null \
|| git fetch --depth=1 origin amd/deepseek_v4
git checkout --force "$SGL_PR_SHA"
test "$(git rev-parse HEAD)" = "$SGL_PR_SHA"

# Reinstall just the Python package; the image already has the ROCm
# kernel deps (aiter, triton, tilelang, torch) at versions matched to
# this branch, so --no-deps avoids pip resolving them against PyPI.
pip install --no-build-isolation --no-deps --force-reinstall -e python/
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new SGLang pin block at dsv4_fp8_mi355x.sh:30-41 ends with test "$(git rev-parse HEAD)" = "$SGL_PR_SHA" as a pin assertion, but this script (and the sourced benchmark_lib.sh) never enable set -e / set -o pipefail, so a failed assertion (or a failed git checkout --force) is silently ignored — the subshell's exit status is the trailing pip install. The two sibling scripts in this same DSv4 family (dsv4_fp8_mi355x_vllm.sh and dsv4_fp4_mi355x_atom.sh) both start with set -eo pipefail, so this looks like an oversight; when amd/deepseek_v4 advances past SGL_PR_SHA (the in-script comment explicitly anticipates this) and fetch-by-SHA is rejected, the fallback only fetches the new branch tip, the SHA checkout fails, the assertion fires-but-is-ignored, and pip installs the wrong commit while perf-changelog continues to attribute results to SHA 18afbf15. Fix by adding set -eo pipefail at the top, or at minimum || exit 1 on the assertion (and on the checkout).

Extended reasoning...

What the bug is

The PR adds a runtime SGLang overlay block in benchmarks/single_node/dsv4_fp8_mi355x.sh:

SGL_PR_SHA="18afbf151a2992b06a089191769b299629ed73dd"
...
(
    cd "$SGL_PR_DIR"
    git fetch --depth=1 origin "$SGL_PR_SHA" 2>/dev/null \
        || git fetch --depth=1 origin amd/deepseek_v4
    git checkout --force "$SGL_PR_SHA"
    test "$(git rev-parse HEAD)" = "$SGL_PR_SHA"
    pip install --no-build-isolation --no-deps --force-reinstall -e python/
)

The test line is clearly meant as a hard pin assertion. But the script's shebang is just #!/usr/bin/env bash with no set -e / set -o pipefail, and the sourced benchmarks/benchmark_lib.sh only sets set -x inside two helper functions — never error-on-failure mode. So with default bash semantics, the subshell's exit status is the exit status of its last command (pip install), and an earlier failed test (or git checkout) does not abort the subshell.

Why it's clearly an oversight

The two sibling DSv4 scripts in the same directory use the identical pinning pattern but explicitly start with set -eo pipefail:

  • benchmarks/single_node/dsv4_fp8_mi355x_vllm.sh:2set -eo pipefail
  • benchmarks/single_node/dsv4_fp4_mi355x_atom.sh:2set -eo pipefail

The author of those sister scripts knew the test-line is meant as a hard assertion and that strict mode is required for it to act as one. The new dsv4_fp8_mi355x.sh copied the pin pattern but dropped the strict-mode preamble.

Step-by-step proof of the failure path

The in-script comment says Bump SGL_PR_SHA when the branch advances — i.e. the author explicitly anticipates that amd/deepseek_v4 will advance past SGL_PR_SHA between bumps. That is exactly the case where the silent failure manifests:

  1. git fetch --depth=1 origin "$SGL_PR_SHA" 2>/dev/null fails. github.com supports fetch-by-SHA via uploadpack.allowReachableSHA1InWant/allowAnySHA1InWant for many but not all repos, and shallow-fetch-by-SHA can be rejected even when the SHA is reachable. (The script's own author wrote a fallback for exactly this case.)
  2. The || fallback runs git fetch --depth=1 origin amd/deepseek_v4, which only fetches the current tip of that branch. If the branch advanced past SGL_PR_SHA (the anticipated case), the pinned SHA is not in the local repo.
  3. git checkout --force "$SGL_PR_SHA" fails (error: pathspec '18afbf15…' did not match any file(s) known to git), leaving HEAD at whatever it was previously checked out — typically the previous run's branch tip.
  4. test "$(git rev-parse HEAD)" = "$SGL_PR_SHA" returns 1 — but with no set -e, execution continues into the next command.
  5. pip install --no-build-isolation --no-deps --force-reinstall -e python/ succeeds (it's just an editable install of whatever the current checkout is), so the subshell exits 0 and the parent script keeps going.
  6. The post-overlay diagnostic line python3 -c "import sglang; print(f'sglang {sglang.__version__} from …')" only prints __version__ — it does not validate against SGL_PR_SHA. __version__ on the moved branch tip will look reasonable.
  7. The benchmark proceeds and produces numbers attributed in perf-changelog.yaml to Reinstall SGLang from amd/deepseek_v4 SHA 18afbf15 at runtime — but the numbers were generated by a different commit. This directly defeats the in-script comment's stated goal: we want a reproducible pin per benchmark run.

Impact

  • perf-changelog.yaml mis-attributes results to SHA 18afbf15 while running against whatever the branch tip happens to be at run time.
  • This silently breaks reproducibility precisely in the scenario the comment was written for (Bump SGL_PR_SHA when the branch advances).
  • The diagnostic __version__ print does not catch the misattribution.

Fix

One-line fix — match the sister scripts:

#!/usr/bin/env bash
set -eo pipefail

Or, at minimum, fail loudly on the assertion and the checkout:

git checkout --force "$SGL_PR_SHA" || exit 1
test "$(git rev-parse HEAD)" = "$SGL_PR_SHA" || exit 1

…x-sgl

# Conflicts:
#	.github/configs/amd-master.yaml
#	perf-changelog.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant