Skip to content

[AMD][ROCM] GLM5/5.1 FP8 MTP Image Update and Search Space Optimization#1252

Merged
chunfangamd merged 2 commits intomainfrom
chun-ajith/glm5-5.1-gp8-mtp
May 1, 2026
Merged

[AMD][ROCM] GLM5/5.1 FP8 MTP Image Update and Search Space Optimization#1252
chunfangamd merged 2 commits intomainfrom
chun-ajith/glm5-5.1-gp8-mtp

Conversation

@chunfangamd
Copy link
Copy Markdown
Collaborator

Author: @ajith-sirra-amd

  • Updated the Image for glm5-fp8-mi355x-sglang-mtp
  • Optimized the search space
  • Removed redundant transformer installation
  • Optimization model serves configs

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@chunfangamd chunfangamd enabled auto-merge (squash) May 1, 2026 10:55
Comment on lines 42 to 52
--port $PORT \
--tensor-parallel-size $TP \
--trust-remote-code \
--cuda-graph-max-bs $CONC \
--context-length $CONTEXT_LENGTH \
--mem-fraction-static 0.85 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--mem-fraction-static 0.85 \
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
--nsa-prefill-backend tilelang \
--nsa-decode-backend tilelang $EVAL_CONTEXT_ARGS \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 When EVAL_ONLY=true, sglang.launch_server now receives --context-length twice: the new unconditional --context-length $CONTEXT_LENGTH (= ISL+OSL+32) at line 43, plus --context-length $EVAL_MAX_MODEL_LEN from $EVAL_CONTEXT_ARGS at line 49. The sibling qwen3.5_bf16_mi355x_mtp.sh mutually-excludes the two via an else branch (EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH") — please follow that pattern so the eval value isn't quietly relying on argparse last-wins semantics.

Extended reasoning...

What the bug is

After this PR, benchmarks/single_node/glm5_fp8_mi355x_mtp.sh defines CONTEXT_LENGTH=$((ISL + OSL + 32)) (line 28) and unconditionally passes --context-length $CONTEXT_LENGTH to sglang.launch_server (line 43). However, the pre-existing EVAL_CONTEXT_ARGS block (lines 30-33) still appends --context-length $EVAL_MAX_MODEL_LEN when EVAL_ONLY=true, and that variable is still expanded on the launch line (line 49 in the new file). So in eval mode the launch command contains two --context-length flags with different values.

Code path that triggers it

When the harness sets EVAL_ONLY=true:

  1. Line 28: CONTEXT_LENGTH = ISL + OSL + 32 (e.g. 1024+1024+32 = 2080).
  2. Lines 30-33: setup_eval_context runs and EVAL_MAX_MODEL_LEN is computed (typically ISL+OSL+256 plus a per-model floor inside compute_eval_context_length), and EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN".
  3. Line 43: passes --context-length 2080.
  4. Line 49: expands $EVAL_CONTEXT_ARGS, adding --context-length $EVAL_MAX_MODEL_LEN.

Why existing code doesn't prevent it

The EVAL_CONTEXT_ARGS indirection was originally designed to be the only place --context-length is set, with throughput mode leaving the flag off (defaulting to the model's max). The new unconditional flag broke that contract. The companion qwen3.5_bf16_mi355x_mtp.sh:26-31 shows the canonical pattern — an else branch sets EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH" and the launch line never repeats --context-length.

Step-by-step proof (ISL=1024, OSL=1024, EVAL_ONLY=true)

  1. CONTEXT_LENGTH = 1024 + 1024 + 32 = 2080.
  2. setup_eval_context sets EVAL_MAX_MODEL_LEN to e.g. ISL+OSL+256 = 2304 (or larger with the GLM floor).
  3. EVAL_CONTEXT_ARGS="--context-length 2304".
  4. Effective command:
    python3 -m sglang.launch_server ... --context-length 2080 ... --context-length 2304 ...
    
  5. Two --context-length arguments are present.

Impact

Python argparse takes the last occurrence, so at runtime the eval value (2304) currently wins — meaning the script happens to work today. But the duplicate flag is fragile (any reordering of the launch line, or a future SGLang argparse change to detect duplicate flags, would silently flip which value is used) and inconsistent with every other script in the directory using this idiom. It is also actively confusing in server logs.

How to fix

Mirror the qwen3.5 mtp pattern: add an else branch and drop the unconditional flag, e.g.

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
    setup_eval_context
    EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
else
    EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH"
fi

and remove the unconditional --context-length $CONTEXT_LENGTH from the launch invocation.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@chunfangamd chunfangamd merged commit 2d0d428 into main May 1, 2026
38 checks passed
@chunfangamd chunfangamd deleted the chun-ajith/glm5-5.1-gp8-mtp branch May 1, 2026 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants