dsv4-mi355x-sgl#1244
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
095be90 to
145c969
Compare
145c969 to
c731702
Compare
| ( | ||
| cd "$SGL_PR_DIR" | ||
| git fetch --depth=1 origin "$SGL_PR_SHA" 2>/dev/null \ | ||
| || git fetch --depth=1 origin amd/deepseek_v4 | ||
| git checkout --force "$SGL_PR_SHA" | ||
| test "$(git rev-parse HEAD)" = "$SGL_PR_SHA" | ||
|
|
||
| # Reinstall just the Python package; the image already has the ROCm | ||
| # kernel deps (aiter, triton, tilelang, torch) at versions matched to | ||
| # this branch, so --no-deps avoids pip resolving them against PyPI. | ||
| pip install --no-build-isolation --no-deps --force-reinstall -e python/ | ||
| ) |
There was a problem hiding this comment.
🔴 The new SGLang pin block at dsv4_fp8_mi355x.sh:30-41 ends with test "$(git rev-parse HEAD)" = "$SGL_PR_SHA" as a pin assertion, but this script (and the sourced benchmark_lib.sh) never enable set -e / set -o pipefail, so a failed assertion (or a failed git checkout --force) is silently ignored — the subshell's exit status is the trailing pip install. The two sibling scripts in this same DSv4 family (dsv4_fp8_mi355x_vllm.sh and dsv4_fp4_mi355x_atom.sh) both start with set -eo pipefail, so this looks like an oversight; when amd/deepseek_v4 advances past SGL_PR_SHA (the in-script comment explicitly anticipates this) and fetch-by-SHA is rejected, the fallback only fetches the new branch tip, the SHA checkout fails, the assertion fires-but-is-ignored, and pip installs the wrong commit while perf-changelog continues to attribute results to SHA 18afbf15. Fix by adding set -eo pipefail at the top, or at minimum || exit 1 on the assertion (and on the checkout).
Extended reasoning...
What the bug is
The PR adds a runtime SGLang overlay block in benchmarks/single_node/dsv4_fp8_mi355x.sh:
SGL_PR_SHA="18afbf151a2992b06a089191769b299629ed73dd"
...
(
cd "$SGL_PR_DIR"
git fetch --depth=1 origin "$SGL_PR_SHA" 2>/dev/null \
|| git fetch --depth=1 origin amd/deepseek_v4
git checkout --force "$SGL_PR_SHA"
test "$(git rev-parse HEAD)" = "$SGL_PR_SHA"
pip install --no-build-isolation --no-deps --force-reinstall -e python/
)The test line is clearly meant as a hard pin assertion. But the script's shebang is just #!/usr/bin/env bash with no set -e / set -o pipefail, and the sourced benchmarks/benchmark_lib.sh only sets set -x inside two helper functions — never error-on-failure mode. So with default bash semantics, the subshell's exit status is the exit status of its last command (pip install), and an earlier failed test (or git checkout) does not abort the subshell.
Why it's clearly an oversight
The two sibling DSv4 scripts in the same directory use the identical pinning pattern but explicitly start with set -eo pipefail:
benchmarks/single_node/dsv4_fp8_mi355x_vllm.sh:2→set -eo pipefailbenchmarks/single_node/dsv4_fp4_mi355x_atom.sh:2→set -eo pipefail
The author of those sister scripts knew the test-line is meant as a hard assertion and that strict mode is required for it to act as one. The new dsv4_fp8_mi355x.sh copied the pin pattern but dropped the strict-mode preamble.
Step-by-step proof of the failure path
The in-script comment says Bump SGL_PR_SHA when the branch advances — i.e. the author explicitly anticipates that amd/deepseek_v4 will advance past SGL_PR_SHA between bumps. That is exactly the case where the silent failure manifests:
git fetch --depth=1 origin "$SGL_PR_SHA" 2>/dev/nullfails. github.com supports fetch-by-SHA viauploadpack.allowReachableSHA1InWant/allowAnySHA1InWantfor many but not all repos, and shallow-fetch-by-SHA can be rejected even when the SHA is reachable. (The script's own author wrote a fallback for exactly this case.)- The
||fallback runsgit fetch --depth=1 origin amd/deepseek_v4, which only fetches the current tip of that branch. If the branch advanced pastSGL_PR_SHA(the anticipated case), the pinned SHA is not in the local repo. git checkout --force "$SGL_PR_SHA"fails (error: pathspec '18afbf15…' did not match any file(s) known to git), leaving HEAD at whatever it was previously checked out — typically the previous run's branch tip.test "$(git rev-parse HEAD)" = "$SGL_PR_SHA"returns1— but with noset -e, execution continues into the next command.pip install --no-build-isolation --no-deps --force-reinstall -e python/succeeds (it's just an editable install of whatever the current checkout is), so the subshell exits 0 and the parent script keeps going.- The post-overlay diagnostic line
python3 -c "import sglang; print(f'sglang {sglang.__version__} from …')"only prints__version__— it does not validate againstSGL_PR_SHA.__version__on the moved branch tip will look reasonable. - The benchmark proceeds and produces numbers attributed in
perf-changelog.yamltoReinstall SGLang from amd/deepseek_v4 SHA 18afbf15 at runtime— but the numbers were generated by a different commit. This directly defeats the in-script comment's stated goal:we want a reproducible pin per benchmark run.
Impact
perf-changelog.yamlmis-attributes results to SHA18afbf15while running against whatever the branch tip happens to be at run time.- This silently breaks reproducibility precisely in the scenario the comment was written for (
Bump SGL_PR_SHA when the branch advances). - The diagnostic
__version__print does not catch the misattribution.
Fix
One-line fix — match the sister scripts:
#!/usr/bin/env bash
set -eo pipefailOr, at minimum, fail loudly on the assertion and the checkout:
git checkout --force "$SGL_PR_SHA" || exit 1
test "$(git rev-parse HEAD)" = "$SGL_PR_SHA" || exit 1…x-sgl # Conflicts: # .github/configs/amd-master.yaml # perf-changelog.yaml
00a25ec to
e18fc23
Compare
No description provided.