[AMD][ROCM] GLM5/5.1 FP8 MTP Image Update and Search Space Optimization#1252
[AMD][ROCM] GLM5/5.1 FP8 MTP Image Update and Search Space Optimization#1252chunfangamd merged 2 commits intomainfrom
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| --port $PORT \ | ||
| --tensor-parallel-size $TP \ | ||
| --trust-remote-code \ | ||
| --cuda-graph-max-bs $CONC \ | ||
| --context-length $CONTEXT_LENGTH \ | ||
| --mem-fraction-static 0.85 \ | ||
| --tool-call-parser glm47 \ | ||
| --reasoning-parser glm45 \ | ||
| --mem-fraction-static 0.85 \ | ||
| --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \ | ||
| --nsa-prefill-backend tilelang \ | ||
| --nsa-decode-backend tilelang $EVAL_CONTEXT_ARGS \ |
There was a problem hiding this comment.
🔴 When EVAL_ONLY=true, sglang.launch_server now receives --context-length twice: the new unconditional --context-length $CONTEXT_LENGTH (= ISL+OSL+32) at line 43, plus --context-length $EVAL_MAX_MODEL_LEN from $EVAL_CONTEXT_ARGS at line 49. The sibling qwen3.5_bf16_mi355x_mtp.sh mutually-excludes the two via an else branch (EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH") — please follow that pattern so the eval value isn't quietly relying on argparse last-wins semantics.
Extended reasoning...
What the bug is
After this PR, benchmarks/single_node/glm5_fp8_mi355x_mtp.sh defines CONTEXT_LENGTH=$((ISL + OSL + 32)) (line 28) and unconditionally passes --context-length $CONTEXT_LENGTH to sglang.launch_server (line 43). However, the pre-existing EVAL_CONTEXT_ARGS block (lines 30-33) still appends --context-length $EVAL_MAX_MODEL_LEN when EVAL_ONLY=true, and that variable is still expanded on the launch line (line 49 in the new file). So in eval mode the launch command contains two --context-length flags with different values.
Code path that triggers it
When the harness sets EVAL_ONLY=true:
- Line 28:
CONTEXT_LENGTH = ISL + OSL + 32(e.g. 1024+1024+32 = 2080). - Lines 30-33:
setup_eval_contextruns andEVAL_MAX_MODEL_LENis computed (typically ISL+OSL+256 plus a per-model floor insidecompute_eval_context_length), andEVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN". - Line 43: passes
--context-length 2080. - Line 49: expands
$EVAL_CONTEXT_ARGS, adding--context-length $EVAL_MAX_MODEL_LEN.
Why existing code doesn't prevent it
The EVAL_CONTEXT_ARGS indirection was originally designed to be the only place --context-length is set, with throughput mode leaving the flag off (defaulting to the model's max). The new unconditional flag broke that contract. The companion qwen3.5_bf16_mi355x_mtp.sh:26-31 shows the canonical pattern — an else branch sets EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH" and the launch line never repeats --context-length.
Step-by-step proof (ISL=1024, OSL=1024, EVAL_ONLY=true)
CONTEXT_LENGTH = 1024 + 1024 + 32 = 2080.setup_eval_contextsetsEVAL_MAX_MODEL_LENto e.g. ISL+OSL+256 = 2304 (or larger with the GLM floor).EVAL_CONTEXT_ARGS="--context-length 2304".- Effective command:
python3 -m sglang.launch_server ... --context-length 2080 ... --context-length 2304 ... - Two
--context-lengtharguments are present.
Impact
Python argparse takes the last occurrence, so at runtime the eval value (2304) currently wins — meaning the script happens to work today. But the duplicate flag is fragile (any reordering of the launch line, or a future SGLang argparse change to detect duplicate flags, would silently flip which value is used) and inconsistent with every other script in the directory using this idiom. It is also actively confusing in server logs.
How to fix
Mirror the qwen3.5 mtp pattern: add an else branch and drop the unconditional flag, e.g.
EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
else
EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH"
fiand remove the unconditional --context-length $CONTEXT_LENGTH from the launch invocation.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25211851871 |
Author: @ajith-sirra-amd