[AMD][ROCM] GLM5/5.1 FP8 MTP Image Update and Search Space Optimization by chunfangamd · Pull Request #1252 · SemiAnalysisAI/InferenceX

chunfangamd · 2026-05-01T10:49:34Z

Updated the Image for glm5-fp8-mi355x-sglang-mtp
Optimized the search space
Removed redundant transformer installation
Optimization model serves configs

…ng-mtp

github-actions · 2026-05-01T10:49:42Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-05-01T10:58:38Z

    --port $PORT \
    --tensor-parallel-size $TP \
    --trust-remote-code \
+    --cuda-graph-max-bs $CONC \
+    --context-length $CONTEXT_LENGTH \
+    --mem-fraction-static 0.85 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
-    --mem-fraction-static 0.85 \
    --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
    --nsa-prefill-backend tilelang \
    --nsa-decode-backend tilelang $EVAL_CONTEXT_ARGS  \


🔴 When EVAL_ONLY=true, sglang.launch_server now receives --context-length twice: the new unconditional --context-length $CONTEXT_LENGTH (= ISL+OSL+32) at line 43, plus --context-length $EVAL_MAX_MODEL_LEN from $EVAL_CONTEXT_ARGS at line 49. The sibling qwen3.5_bf16_mi355x_mtp.sh mutually-excludes the two via an else branch (EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH") — please follow that pattern so the eval value isn't quietly relying on argparse last-wins semantics.

Extended reasoning...

What the bug is

After this PR, benchmarks/single_node/glm5_fp8_mi355x_mtp.sh defines CONTEXT_LENGTH=$((ISL + OSL + 32)) (line 28) and unconditionally passes --context-length $CONTEXT_LENGTH to sglang.launch_server (line 43). However, the pre-existing EVAL_CONTEXT_ARGS block (lines 30-33) still appends --context-length $EVAL_MAX_MODEL_LEN when EVAL_ONLY=true, and that variable is still expanded on the launch line (line 49 in the new file). So in eval mode the launch command contains two --context-length flags with different values.

Code path that triggers it

When the harness sets EVAL_ONLY=true:

Line 28: CONTEXT_LENGTH = ISL + OSL + 32 (e.g. 1024+1024+32 = 2080).

Lines 30-33: setup_eval_context runs and EVAL_MAX_MODEL_LEN is computed (typically ISL+OSL+256 plus a per-model floor inside compute_eval_context_length), and EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN".

Line 43: passes --context-length 2080.

Line 49: expands $EVAL_CONTEXT_ARGS, adding --context-length $EVAL_MAX_MODEL_LEN.

Why existing code doesn't prevent it

The EVAL_CONTEXT_ARGS indirection was originally designed to be the only place --context-length is set, with throughput mode leaving the flag off (defaulting to the model's max). The new unconditional flag broke that contract. The companion qwen3.5_bf16_mi355x_mtp.sh:26-31 shows the canonical pattern — an else branch sets EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH" and the launch line never repeats --context-length.

Step-by-step proof (ISL=1024, OSL=1024, EVAL_ONLY=true)

CONTEXT_LENGTH = 1024 + 1024 + 32 = 2080.

setup_eval_context sets EVAL_MAX_MODEL_LEN to e.g. ISL+OSL+256 = 2304 (or larger with the GLM floor).

EVAL_CONTEXT_ARGS="--context-length 2304".

Effective command:
python3 -m sglang.launch_server ... --context-length 2080 ... --context-length 2304 ...

Two --context-length arguments are present.

Impact

Python argparse takes the last occurrence, so at runtime the eval value (2304) currently wins — meaning the script happens to work today. But the duplicate flag is fragile (any reordering of the launch line, or a future SGLang argparse change to detect duplicate flags, would silently flip which value is used) and inconsistent with every other script in the directory using this idiom. It is also actively confusing in server logs.

How to fix

Mirror the qwen3.5 mtp pattern: add an else branch and drop the unconditional flag, e.g.

EVAL_CONTEXT_ARGS="" if [ "${EVAL_ONLY}" = "true" ]; then setup_eval_context EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH" fi

and remove the unconditional --context-length $CONTEXT_LENGTH from the launch invocation.

github-actions · 2026-05-01T11:51:02Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25211851871
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25211851871

functionstackx

lgtm

[ajith's work] update image and search-space for glm5-fp8-mi355x-sgla…

d236f29

…ng-mtp

chunfangamd requested a review from a team May 1, 2026 10:49

chunfangamd requested review from billishyahao and seungrokj as code owners May 1, 2026 10:49

github-project-automation Bot added this to InferenceMAX Board May 1, 2026

chunfangamd requested review from 1am9trash and yctseng0211 as code owners May 1, 2026 10:49

Update Perf-changelog

67bd1f0

chunfangamd added the sweep-enabled label May 1, 2026

billishyahao approved these changes May 1, 2026

View reviewed changes

chunfangamd enabled auto-merge (squash) May 1, 2026 10:55

claude Bot reviewed May 1, 2026

View reviewed changes

functionstackx approved these changes May 1, 2026

View reviewed changes

chunfangamd merged commit 2d0d428 into main May 1, 2026
38 checks passed

chunfangamd deleted the chun-ajith/glm5-5.1-gp8-mtp branch May 1, 2026 15:38

github-project-automation Bot moved this to Done in InferenceMAX Board May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD][ROCM] GLM5/5.1 FP8 MTP Image Update and Search Space Optimization#1252

[AMD][ROCM] GLM5/5.1 FP8 MTP Image Update and Search Space Optimization#1252
chunfangamd merged 2 commits intomainfrom
chun-ajith/glm5-5.1-gp8-mtp

chunfangamd commented May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

claude Bot May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

functionstackx left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chunfangamd commented May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

claude Bot May 1, 2026

Choose a reason for hiding this comment

What the bug is

Code path that triggers it

Why existing code doesn't prevent it

Step-by-step proof (ISL=1024, OSL=1024, EVAL_ONLY=true)

Impact

How to fix

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants