Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -364,7 +364,7 @@ glm5-fp8-mi355x-sglang:
- { tp: 8, conc-start: 4, conc-end: 64 }

glm5-fp8-mi355x-sglang-mtp:
image: lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260413
image: lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260415
model: zai-org/GLM-5-FP8
model-prefix: glm5
runner: mi355x
Expand All @@ -375,11 +375,13 @@ glm5-fp8-mi355x-sglang-mtp:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: mtp }
- { tp: 4, conc-start: 4, conc-end: 128, spec-decoding: mtp }
- { tp: 8, conc-start: 4, conc-end: 8, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: mtp }
- { tp: 4, conc-start: 4, conc-end: 128, spec-decoding: mtp }
- { tp: 8, conc-start: 4, conc-end: 8, spec-decoding: mtp }

glm5-fp8-mi355x-atom:
image: rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post
Expand Down
12 changes: 6 additions & 6 deletions benchmarks/single_node/glm5_fp8_mi355x_mtp.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env bash
set -x

source "$(dirname "$0")/../benchmark_lib.sh"

Expand All @@ -15,11 +16,6 @@ if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

# GLM-5 requires transformers with glm_moe_dsa model type support.
# However, the Image rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219 doesn't provide this support.
python3 -m pip install -U --no-cache-dir \
"git+https://github.com/huggingface/transformers.git@6ed9ee36f608fd145168377345bfc4a5de12e1e2"

hf download "$MODEL"

# ROCm / SGLang performance tuning for MI355X
Expand All @@ -30,6 +26,7 @@ export SGLANG_ENABLE_SPEC_V2=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
CONTEXT_LENGTH=$((ISL + OSL + 32))

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
Expand All @@ -45,9 +42,11 @@ python3 -m sglang.launch_server \
--port $PORT \
--tensor-parallel-size $TP \
--trust-remote-code \
--cuda-graph-max-bs $CONC \
--context-length $CONTEXT_LENGTH \
--mem-fraction-static 0.85 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--mem-fraction-static 0.85 \
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
--nsa-prefill-backend tilelang \
--nsa-decode-backend tilelang $EVAL_CONTEXT_ARGS \
Comment on lines 42 to 52
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 When EVAL_ONLY=true, sglang.launch_server now receives --context-length twice: the new unconditional --context-length $CONTEXT_LENGTH (= ISL+OSL+32) at line 43, plus --context-length $EVAL_MAX_MODEL_LEN from $EVAL_CONTEXT_ARGS at line 49. The sibling qwen3.5_bf16_mi355x_mtp.sh mutually-excludes the two via an else branch (EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH") — please follow that pattern so the eval value isn't quietly relying on argparse last-wins semantics.

Extended reasoning...

What the bug is

After this PR, benchmarks/single_node/glm5_fp8_mi355x_mtp.sh defines CONTEXT_LENGTH=$((ISL + OSL + 32)) (line 28) and unconditionally passes --context-length $CONTEXT_LENGTH to sglang.launch_server (line 43). However, the pre-existing EVAL_CONTEXT_ARGS block (lines 30-33) still appends --context-length $EVAL_MAX_MODEL_LEN when EVAL_ONLY=true, and that variable is still expanded on the launch line (line 49 in the new file). So in eval mode the launch command contains two --context-length flags with different values.

Code path that triggers it

When the harness sets EVAL_ONLY=true:

  1. Line 28: CONTEXT_LENGTH = ISL + OSL + 32 (e.g. 1024+1024+32 = 2080).
  2. Lines 30-33: setup_eval_context runs and EVAL_MAX_MODEL_LEN is computed (typically ISL+OSL+256 plus a per-model floor inside compute_eval_context_length), and EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN".
  3. Line 43: passes --context-length 2080.
  4. Line 49: expands $EVAL_CONTEXT_ARGS, adding --context-length $EVAL_MAX_MODEL_LEN.

Why existing code doesn't prevent it

The EVAL_CONTEXT_ARGS indirection was originally designed to be the only place --context-length is set, with throughput mode leaving the flag off (defaulting to the model's max). The new unconditional flag broke that contract. The companion qwen3.5_bf16_mi355x_mtp.sh:26-31 shows the canonical pattern — an else branch sets EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH" and the launch line never repeats --context-length.

Step-by-step proof (ISL=1024, OSL=1024, EVAL_ONLY=true)

  1. CONTEXT_LENGTH = 1024 + 1024 + 32 = 2080.
  2. setup_eval_context sets EVAL_MAX_MODEL_LEN to e.g. ISL+OSL+256 = 2304 (or larger with the GLM floor).
  3. EVAL_CONTEXT_ARGS="--context-length 2304".
  4. Effective command:
    python3 -m sglang.launch_server ... --context-length 2080 ... --context-length 2304 ...
    
  5. Two --context-length arguments are present.

Impact

Python argparse takes the last occurrence, so at runtime the eval value (2304) currently wins — meaning the script happens to work today. But the duplicate flag is fragile (any reordering of the launch line, or a future SGLang argparse change to detect duplicate flags, would silently flip which value is used) and inconsistent with every other script in the directory using this idiom. It is also actively confusing in server logs.

How to fix

Mirror the qwen3.5 mtp pattern: add an else branch and drop the unconditional flag, e.g.

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
    setup_eval_context
    EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
else
    EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH"
fi

and remove the unconditional --context-length $CONTEXT_LENGTH from the launch invocation.

Expand All @@ -56,6 +55,7 @@ python3 -m sglang.launch_server \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--tokenizer-worker-num $((TP*2)) \
--disable-radix-cache> $SERVER_LOG 2>&1 &

SERVER_PID=$!
Expand Down
9 changes: 9 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2069,3 +2069,12 @@
- "Recipes cover 8k/1k aggregate TP8 low-latency conc=1, low-latency bridge 1P DEP8 + 4D TP8 no-offload conc=16/32/64, mid 1P/1D DEP8 MegaMOE conc=128, and high-throughput 2P/1D DEP8 MegaMOE conc=1024"
- "All recipes enable FP4 indexer cache and speculative-config mtp with num_speculative_tokens=2"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1242

- config-keys:
- glm5-fp8-mi355x-sglang-mtp
description:
- "Updated the Image for glm5-fp8-mi355x-sglang-mtp"
- "Optimized the search space"
- "Removed redundant transformer installation"
- "Optimization model serves configs"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1252
Loading