Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -319,6 +319,24 @@ glm5-fp8-mi355x-sglang-mtp:
search-space:
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: mtp }

glm5-fp4-mi355x-sglang-mtp:
image: lmsysorg/sglang-rocm:v0.5.10rc0-rocm700-mi35x-20260422
model: amd/GLM-5-MXFP4
model-prefix: glm5
runner: mi355x
precision: fp4
framework: sglang
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: mtp }

glm5-fp8-mi355x-atom:
image: rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2.post
model: zai-org/GLM-5-FP8
Expand Down
78 changes: 78 additions & 0 deletions benchmarks/single_node/glm5_fp4_mi355x_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

hf download "$MODEL"

export SGLANG_ENABLE_SPEC_V2=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

python3 -m sglang.launch_server \
--model-path $MODEL \
--host=0.0.0.0 \
--port $PORT \
--trust-remote-code \
--tp $TP \
--chunked-prefill-size 131072 \
--disable-radix-cache \
--mem-fraction-static 0.85 \
--model-loader-extra-config '{"enable_multithread_load": true}' \
--watchdog-timeout 1200 \
--reasoning-parser glm45 \
--tool-call-parser glm47 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
$EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &
Comment on lines +33 to +50
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new glm5_fp4_mi355x_mtp.sh is missing two configuration pieces that its direct peer glm5_fp8_mi355x_mtp.sh sets: (1) --kv-cache-dtype fp8_e4m3 on the sglang.launch_server invocation (lines 33–50), and (2) the three MI355X ROCm/SGLang perf-tuning env vars (SGLANG_ROCM_FUSED_DECODE_MLA=0, ROCM_QUICK_REDUCE_QUANTIZATION=INT4, SAFETENSORS_FAST_GPU=1) around line 20 where only SGLANG_ENABLE_SPEC_V2=1 is exported. Both are set uniformly across every other GLM-5 / MI355X SGLang script in the repo, so the FP4-MTP benchmark numbers will not be comparable to the FP8-MTP counterpart — recommend mirroring the glm5_fp8_mi355x_mtp.sh template for both the env block and the launch-flag list.

Extended reasoning...

What the bug is

The newly added benchmarks/single_node/glm5_fp4_mi355x_mtp.sh omits two configuration pieces that every peer GLM-5 / MI355X SGLang script in the repo sets. The omissions do not cause a crash — they silently change KV cache memory accounting and ROCm kernel selection, making the FP4-MTP benchmark numbers non-comparable to its FP8-MTP counterpart (glm5_fp8_mi355x_mtp.sh), which is the direct template for this config.

Specific omissions vs. the FP8-MTP peer (glm5_fp8_mi355x_mtp.sh)

  1. --kv-cache-dtype fp8_e4m3 — set at line 54 of the FP8-MTP peer; absent in the new script's sglang.launch_server block at lines 33–50. Grep of benchmarks/single_node/glm5*.sh shows every other GLM-5 script (all 11 peers across B200/B300/MI355X × FP4/FP8 × MTP/non-MTP) passes this flag: glm5_fp8_mi355x.sh:53, glm5_fp8_mi355x_mtp.sh:54, glm5_fp4_b200.sh:43, glm5_fp4_b200_mtp.sh:48, glm5_fp4_b300.sh:47, glm5_fp4_b300_mtp.sh:52, glm5_fp8_b200.sh:47, glm5_fp8_b200_mtp.sh:48, glm5_fp8_b300.sh:51, glm5_fp8_b300_mtp.sh:52, glm5.1_fp4_mi355x.sh:54. The new glm5_fp4_mi355x_mtp.sh is the sole exception.

  2. The three MI355X env vars, set at lines 26–28 of the FP8-MTP peer under the comment # ROCm / SGLang performance tuning for MI355X:

    • SGLANG_ROCM_FUSED_DECODE_MLA=0
    • ROCM_QUICK_REDUCE_QUANTIZATION=INT4
    • SAFETENSORS_FAST_GPU=1
      These three also appear in every other MI355X SGLang single-node script: dsr1_fp4_mi355x.sh, dsr1_fp8_mi355x.sh, glm5.1_fp4_mi355x.sh, glm5_fp8_mi355x.sh, gptoss_fp4_mi355x.sh, kimik2.5_fp4_mi355x.sh, minimaxm2.5_fp8_mi355x.sh. The new script only exports SGLANG_ENABLE_SPEC_V2=1 (line 20).

Why existing code does not catch this

SGLang does not error when --kv-cache-dtype is unspecified — it silently defaults to the model compute dtype (bf16/fp16 for the attention path). The three env vars are advisory tuning knobs; missing them simply changes kernel selection and I/O behaviour rather than failing startup. So the server will appear to launch fine, and the only symptom is quietly different benchmark numbers.

Impact (step-by-step proof for --kv-cache-dtype)

Take the published config for this PR: tp: 8, conc-start: 4, conc-end: 64 for both 1k1k and 8k1k. With --mem-fraction-static 0.85 on MI355X (288 GB HBM3e/GPU, so ~2.3 TB total at TP=8), SGLang carves out a fixed KV pool after weights/activations. Per-token KV memory scales linearly with KV cache element size — bf16/fp16 is 2 bytes/element, fp8_e4m3 is 1 byte/element. So for the same pool:

  1. FP8 MTP peer with --kv-cache-dtype fp8_e4m3: N tokens fit.
  2. New FP4 MTP script with default KV dtype: roughly N/2 tokens fit.
  3. At conc=64 with ISL=8192 + spec-decoding draft tokens, each request needs ≥8k of KV plus overhead. The new script will therefore hit scheduler back-pressure or OOM earlier than the FP8 peer at the same concurrency point — so the reported throughput/latency at conc=64 and max-concurrency behavior will diverge from the FP8 peer purely due to the missing flag, not due to the FP4 weight format. Benchmarks for glm5-fp4-mi355x-sglang-mtp will therefore not be meaningfully comparable to glm5-fp8-mi355x-sglang-mtp, which is the whole point of adding this config.

Impact (env vars)

SAFETENSORS_FAST_GPU=1 speeds up weight load; ROCM_QUICK_REDUCE_QUANTIZATION=INT4 selects the INT4 reduce path used uniformly on MI355X; SGLANG_ROCM_FUSED_DECODE_MLA=0 is explicitly disabled across MI355X configs. Whether or not a given flag is a no-op for NSA (which GLM-5 uses instead of MLA), their absence means this script is the only MI355X single-node script with different runtime characteristics than the rest — an inconsistency that will show up in perf-changelog comparisons.

How to fix

Mirror the glm5_fp8_mi355x_mtp.sh template:

  • Add around line 20, before SGLANG_ENABLE_SPEC_V2=1:
    # ROCm / SGLang performance tuning for MI355X
    export SGLANG_ROCM_FUSED_DECODE_MLA=0
    export ROCM_QUICK_REDUCE_QUANTIZATION=INT4
    export SAFETENSORS_FAST_GPU=1
  • Add --kv-cache-dtype fp8_e4m3 \ to the sglang.launch_server arg list (e.g. after --mem-fraction-static 0.85 on line 41).

Note: the existing inline review comment on this PR (id 3105939613) flags the missing NSA backend flags (--nsa-prefill-backend tilelang --nsa-decode-backend tilelang), which is a separate omission from the FP8 MTP peer. The two findings are distinct and both warrant alignment with glm5_fp8_mi355x_mtp.sh.


SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

Comment on lines +43 to +56
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new glm5_fp4_mi355x_mtp.sh launch script is missing the --nsa-prefill-backend tilelang --nsa-decode-backend tilelang flags required by GLM-5's Native Sparse Attention architecture. Every other GLM-5 MI355X script includes these flags; without them SGLang falls back to a standard attention kernel that does not implement NSA correctly, producing silently degraded or incorrect benchmark results.

Extended reasoning...

What the bug is and how it manifests

GLM-5 uses Native Sparse Attention (NSA) as a core architectural component. SGLang requires explicit backend flags to enable the correct NSA attention kernel; without them it silently falls back to a standard dense-attention kernel, which implements a fundamentally different computation pattern and will produce incorrect attention outputs or severely degraded throughput/latency numbers.

The specific code path that triggers it

In benchmarks/single_node/glm5_fp4_mi355x_mtp.sh (lines 37–55), the python3 -m sglang.launch_server invocation includes all the expected GLM-5 flags (reasoning parser, tool-call parser, EAGLE speculative decoding, etc.) but is entirely missing --nsa-prefill-backend tilelang and --nsa-decode-backend tilelang.

Why existing code doesn't prevent it

These flags are not validated or defaulted by the framework – their absence simply causes SGLang to select a default (non-NSA) attention backend. No warning or error is emitted at startup. The model will appear to load and run normally while silently computing incorrect attention.

Concrete proof via comparison

  • glm5_fp8_mi355x.sh (line 51–52): includes --nsa-prefill-backend tilelang --nsa-decode-backend tilelang
  • glm5_fp8_mi355x_mtp.sh (line 52–53): includes --nsa-prefill-backend tilelang --nsa-decode-backend tilelang
  • glm5_fp4_b200_sglang_mtp.sh / B300 variants: include --attention-backend nsa --nsa-decode-backend trtllm --nsa-prefill-backend trtllm (trtllm backend for NVIDIA)
  • glm5_fp4_mi355x_mtp.sh (this PR): no NSA flags at all

Every GLM-5 script on every platform specifies NSA backends; this is the single exception.

Impact

Benchmark results collected with this script will reflect standard attention performance rather than NSA performance. Since NSA is a key differentiator of GLM-5, the published numbers will be misleading at best and incorrect at worst. Correctness of generated output may also be affected.

How to fix

Add --nsa-prefill-backend tilelang --nsa-decode-backend tilelang to the sglang.launch_server invocation, consistent with the existing glm5_fp8_mi355x_mtp.sh script (the direct MI355X FP8 MTP counterpart).

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--use-chat-template

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
11 changes: 11 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1702,3 +1702,14 @@
description:
- "Add VLLM_FLOAT32_MATMUL_PRECISION=high"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1069

- config-keys:
- glm5-fp4-mi355x-sglang-mtp
description:
- "Add GLM-5 MXFP4 MI355X SGLang MTP benchmark"
- "Image: lmsysorg/sglang-rocm:v0.5.10rc0-rocm700-mi35x-20260422"
- "Model: amd/GLM-5-MXFP4"
- "EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) behind SGLANG_ENABLE_SPEC_V2=1"
- "Image ships transformers with glm_moe_dsa support, so no extra pip install is needed (unlike glm5-fp8-mi355x-sglang)"
- "Configs: 1k1k and 8k1k, TP=8 conc 4-64 with spec-decoding=mtp"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX
Loading