-
Notifications
You must be signed in to change notification settings - Fork 155
SGL DSV4 MI355x #1231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Oseltamivir
wants to merge
4
commits into
main
Choose a base branch
from
sgl-dsv4-mi355x
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+225
−4
Open
SGL DSV4 MI355x #1231
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
1c6d5b1
SGL DSV4 MI355x
Oseltamivir 6ebe53d
Bump dsv4-fp8-mi355x-sglang image, add FP4 sglang variant
Oseltamivir c5a49ee
Align with branch run_dsv4.sh after FP4 Models commit; switch FP8 model
Oseltamivir 4a573fd
Roll back to rocm700 image to dodge symmetric-memory crash
Oseltamivir File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,153 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| # Overlay sglang from the amd/deepseek_v4 branch on top of whatever the | ||
| # rocm/sgl-dev:v0.5.10.post1-rocm700-mi35x-20260428 image ships with. We | ||
| # stay on the rocm700 (ROCm 7.0.0a) line because rocm720 hit | ||
| # hipErrorInvalidConfiguration on use_symmetric_memory-allocated dp_attention | ||
| # buffers (RCCL symmetric-memory bug; SGLANG_USE_ROCM700A WA only covers the | ||
| # cuda-graph path, not eager mode that we use via --disable-cuda-graph). | ||
| # Bump SGL_PR_SHA when the branch advances. | ||
| SGL_PR_SHA="18afbf151a2992b06a089191769b299629ed73dd" | ||
| SGL_PR_DIR="/tmp/sglang-amd-dsv4" | ||
|
|
||
| if [ ! -d "$SGL_PR_DIR/.git" ]; then | ||
| git clone --filter=blob:none https://github.com/sgl-project/sglang.git "$SGL_PR_DIR" | ||
| fi | ||
| ( | ||
| cd "$SGL_PR_DIR" | ||
| git fetch --depth=1 origin "$SGL_PR_SHA" 2>/dev/null \ | ||
| || git fetch --depth=1 origin amd/deepseek_v4 | ||
| git checkout --force "$SGL_PR_SHA" | ||
| test "$(git rev-parse HEAD)" = "$SGL_PR_SHA" | ||
|
|
||
| # Reinstall just the Python package; the image already has the ROCm | ||
| # kernel deps (aiter, triton, tilelang, torch) at versions matched to | ||
| # this branch, so --no-deps avoids pip resolving them against PyPI. | ||
| pip install --no-build-isolation --no-deps --force-reinstall -e python/ | ||
| ) | ||
|
|
||
| python3 -c "import sglang; print(f'sglang {sglang.__version__} from {sglang.__path__[0]}')" | ||
|
|
||
| # Transformers in the container doesn't recognize the `deepseek_v4` model_type. | ||
| # PR #23608's fallback in hf_transformers_utils.get_config tries to handle this | ||
| # by writing a patched config to /tmp, but in practice isn't catching the error | ||
| # in this image. Patch the cached config.json directly instead: set model_type | ||
| # to `deepseek_v3` so AutoConfig.from_pretrained succeeds, and keep | ||
| # architectures=['DeepseekV4ForCausalLM'] so SGLang dispatches to its native | ||
| # DSv4 model class (python/sglang/srt/models/deepseek_v4.py). | ||
| python3 << PYEOF | ||
| import json | ||
| from huggingface_hub import hf_hub_download | ||
| path = hf_hub_download(repo_id="$MODEL", filename="config.json") | ||
| with open(path) as f: | ||
| config = json.load(f) | ||
| if config.get("model_type") == "deepseek_v4": | ||
| config["model_type"] = "deepseek_v3" | ||
| with open(path, "w") as f: | ||
| json.dump(config, f, indent=2) | ||
| print(f"Patched {path}: model_type deepseek_v4 -> deepseek_v3") | ||
| else: | ||
| print(f"No patch needed: model_type is {config.get('model_type')!r}") | ||
| PYEOF | ||
|
|
||
| # DSv4 FP4-experts path. Mirrors the active path of python/run_dsv4.sh on | ||
| # the amd/deepseek_v4 branch at SGL_PR_SHA: | ||
| # SGLANG_DSV4_FP4_EXPERTS=True -> route experts through the FP4 kernels | ||
| # SGLANG_FORCE_TRITON_MOE_FP8=0 -> dispatch MoE through aiter (gating | ||
| # switch added in commit 33de1e64); | ||
| # also enables swiglu_limit clamp in the | ||
| # triton MoE fallback path. | ||
| export SGLANG_REASONING_EFFORT=max | ||
| export SGLANG_OPT_USE_FUSED_COMPRESS=false | ||
| export SGLANG_OPT_USE_OLD_COMPRESSOR=true | ||
| export SGLANG_OPT_USE_TILELANG_SWA_PREPARE=false | ||
| export SGLANG_OPT_USE_JIT_KERNEL_FUSED_TOPK=false | ||
| export SGLANG_OPT_USE_FUSED_HASH_TOPK=false | ||
| export SGLANG_HACK_FLASHMLA_BACKEND=torch | ||
| export SGLANG_OPT_DEEPGEMM_HC_PRENORM=false | ||
| export SGLANG_OPT_USE_TILELANG_MHC_PRE=false | ||
| export SGLANG_OPT_USE_TILELANG_MHC_POST=false | ||
| export SGLANG_ENABLE_THINKING=1 | ||
| export SGLANG_USE_AITER=1 | ||
| export SGLANG_USE_ROCM700A=1 | ||
| export SGLANG_TOPK_TRANSFORM_512_TORCH=1 | ||
| export SGLANG_FP8_PAGED_MQA_LOGITS_TORCH=1 | ||
| export SGLANG_DSV4_FP4_EXPERTS=True | ||
| export SGLANG_OPT_DPSK_V4_RADIX=0 | ||
| export SGLANG_OPT_USE_OVERLAP_STORE_CACHE=false | ||
| export SGLANG_OPT_USE_FUSED_STORE_CACHE=false | ||
| export SGLANG_FORCE_TRITON_MOE_FP8=0 | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| fi | ||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
||
| python3 -m sglang.launch_server \ | ||
| --model-path $MODEL \ | ||
| --host=0.0.0.0 \ | ||
| --port $PORT \ | ||
| --tensor-parallel-size $TP \ | ||
| --dp $TP \ | ||
| --enable-dp-attention \ | ||
| --trust-remote-code \ | ||
| --disable-radix-cache \ | ||
| --attention-backend compressed \ | ||
| --max-running-request 256 \ | ||
| --page-size 256 \ | ||
| --chunked-prefill-size 8192 \ | ||
| --disable-shared-experts-fusion \ | ||
| --disable-cuda-graph \ | ||
| --tool-call-parser deepseekv4 \ | ||
| --reasoning-parser deepseek-v4 \ | ||
| --watchdog-timeout 1800 $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The new sglang overlay block (lines 20-41) does not actually enforce the 'reproducible pin per benchmark run' the comment promises: the fallback
git fetch --depth=1 origin amd/deepseek_v4only retrieves the branch tip, so when SGL_PR_SHA lags the branch (the exact case the 'Bump SGL_PR_SHA when the branch advances' comment anticipates) and the by-SHA fetch fails, the subsequentgit checkout --force "$SGL_PR_SHA"cannot resolve the SHA. Combined with the missingset -eo pipefail(the sister scriptdsv4_fp8_mi355x_vllm.sh:2has it), the failed checkout, thetest $(git rev-parse HEAD) = $SGL_PR_SHApin guard at line 35, and any other failure are all swallowed —pip install -e python/runs unconditionally against whatever was previously checked out, and benchmark numbers get reported against the wrong sglang. Fix: addset -eo pipefailnear the top, and either drop the--depth=1fallback or fetchamd/deepseek_v4with deeper history (e.g.--depth=50) so older pinned SHAs remain reachable.Extended reasoning...
Bug
The overlay block added at lines 20-41 promises a "reproducible pin per benchmark run" (see comment lines 20-23), but the control flow does not enforce it. Two compounding issues silently install the wrong sglang on a fetch/pin mismatch:
1. Shallow-fetch fallback cannot resolve a non-tip pin (lines 32-33). The clone at line 26 uses
--filter=blob:noneand clones only the default branch, which is notamd/deepseek_v4. The subshell then runs:The fallback retrieves a single commit — the current tip of
amd/deepseek_v4. The very next line,git checkout --force "$SGL_PR_SHA", then fails withreference is not a treewheneverSGL_PR_SHAis not that tip. The script's own comment "Bump SGL_PR_SHA when the branch advances" explicitly anticipates that the pinned SHA lags the branch tip, so this is not a hypothetical case. The fetch-by-SHA can fail for ordinary reasons (transient GitHub error, unauthenticated rate limit, the SHA being GC'd after a force-push), and the fallback then cannot produce the requested object.2. No
set -eo pipefail(line 1 is the shebang, line 3 sourcesbenchmark_lib.shwith no shell-options line in between). The sister scriptbenchmarks/single_node/dsv4_fp8_mi355x_vllm.sh:2setsset -eo pipefail, and the sourcedbenchmark_lib.shdoes not enable errexit either. Without it, every failure inside the subshell on lines 30-41 is swallowed:git checkout --force "$SGL_PR_SHA"does not abort the subshell;test "$(git rev-parse HEAD)" = "$SGL_PR_SHA"at line 35 — the explicit pin guard — only sets$?, it does not exit;pip install --no-build-isolation --no-deps --force-reinstall -e python/at line 40 then runs unconditionally, against whatever the working tree happens to contain, and its exit code becomes the subshell's, masking every earlier failure;The
python3 -c "import sglang; print(...)"line on 43 only prints the version, it does not assert anything, andsglang.launch_serverthen runs against the wrong sglang.Step-by-step proof
Assume
SGL_PR_SHA = 18afbf15…and the branchamd/deepseek_v4has since advanced to a different SHAY. The fetch-by-SHA call fails (transient503fromgithub.com):amd/deepseek_v4, no18afbf15…).git fetch --depth=1 origin 18afbf15…→ fails (stderr suppressed via2>/dev/null), exit 128.git fetch --depth=1 origin amd/deepseek_v4→ succeeds, downloads only commitY.git checkout --force 18afbf15…→ fails:fatal: reference is not a tree: 18afbf15…. Withoutset -e, subshell continues; HEAD is still whatever the initial clone left (default branch tip).test "$(git rev-parse HEAD)" = "18afbf15…"→$?set non-zero. Subshell continues.pip install … -e python/→ runs against the default-branch tree (not the pin), succeeds, exit 0.import sglang; print(...)→ prints the wrong version.sglang.launch_serverruns with the wrong sglang and benchmark numbers are reported under the assumed pin.A second realistic trigger: the by-SHA fetch succeeds on first run, then on a later invocation in the same
/tmp/sglang-amd-dsv4checkout the network is flaky. Theif [ ! -d "$SGL_PR_DIR/.git" ]guard at line 25 means the clone is reused; the fallback then leaves the local repo with only commitY; the same silent cascade follows.Fix
Minimum: add
set -eo pipefailnear the top of the script (matching the vllm sister script). That alone causes the failed checkout / mismatched-HEAD test to abort the subshell and the script.Additionally, make the fallback actually able to produce non-tip SHAs — drop
--depth=1from the branch fallback, or use a deeper fetch (e.g.--depth=50) so older pinned SHAs remain reachable.