-
Notifications
You must be signed in to change notification settings - Fork 156
[SGLang broken] Add MI355X config: glm5-fp4-sglang-mtp #1091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| export SGLANG_ENABLE_SPEC_V2=1 | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| fi | ||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
||
| python3 -m sglang.launch_server \ | ||
| --model-path $MODEL \ | ||
| --host=0.0.0.0 \ | ||
| --port $PORT \ | ||
| --trust-remote-code \ | ||
| --tp $TP \ | ||
| --chunked-prefill-size 131072 \ | ||
| --disable-radix-cache \ | ||
| --mem-fraction-static 0.85 \ | ||
| --model-loader-extra-config '{"enable_multithread_load": true}' \ | ||
| --watchdog-timeout 1200 \ | ||
| --reasoning-parser glm45 \ | ||
| --tool-call-parser glm47 \ | ||
| --speculative-algorithm EAGLE \ | ||
| --speculative-num-steps 3 \ | ||
| --speculative-eagle-topk 1 \ | ||
| --speculative-num-draft-tokens 4 \ | ||
| $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
|
Comment on lines
+43
to
+56
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 The new Extended reasoning...What the bug is and how it manifests GLM-5 uses Native Sparse Attention (NSA) as a core architectural component. SGLang requires explicit backend flags to enable the correct NSA attention kernel; without them it silently falls back to a standard dense-attention kernel, which implements a fundamentally different computation pattern and will produce incorrect attention outputs or severely degraded throughput/latency numbers. The specific code path that triggers it In Why existing code doesn't prevent it These flags are not validated or defaulted by the framework – their absence simply causes SGLang to select a default (non-NSA) attention backend. No warning or error is emitted at startup. The model will appear to load and run normally while silently computing incorrect attention. Concrete proof via comparison
Every GLM-5 script on every platform specifies NSA backends; this is the single exception. Impact Benchmark results collected with this script will reflect standard attention performance rather than NSA performance. Since NSA is a key differentiator of GLM-5, the published numbers will be misleading at best and incorrect at worst. Correctness of generated output may also be affected. How to fix Add |
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --use-chat-template | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The new
glm5_fp4_mi355x_mtp.shis missing two configuration pieces that its direct peerglm5_fp8_mi355x_mtp.shsets: (1)--kv-cache-dtype fp8_e4m3on thesglang.launch_serverinvocation (lines 33–50), and (2) the three MI355X ROCm/SGLang perf-tuning env vars (SGLANG_ROCM_FUSED_DECODE_MLA=0,ROCM_QUICK_REDUCE_QUANTIZATION=INT4,SAFETENSORS_FAST_GPU=1) around line 20 where onlySGLANG_ENABLE_SPEC_V2=1is exported. Both are set uniformly across every other GLM-5 / MI355X SGLang script in the repo, so the FP4-MTP benchmark numbers will not be comparable to the FP8-MTP counterpart — recommend mirroring theglm5_fp8_mi355x_mtp.shtemplate for both the env block and the launch-flag list.Extended reasoning...
What the bug is
The newly added
benchmarks/single_node/glm5_fp4_mi355x_mtp.shomits two configuration pieces that every peer GLM-5 / MI355X SGLang script in the repo sets. The omissions do not cause a crash — they silently change KV cache memory accounting and ROCm kernel selection, making the FP4-MTP benchmark numbers non-comparable to its FP8-MTP counterpart (glm5_fp8_mi355x_mtp.sh), which is the direct template for this config.Specific omissions vs. the FP8-MTP peer (
glm5_fp8_mi355x_mtp.sh)--kv-cache-dtype fp8_e4m3— set at line 54 of the FP8-MTP peer; absent in the new script'ssglang.launch_serverblock at lines 33–50. Grep ofbenchmarks/single_node/glm5*.shshows every other GLM-5 script (all 11 peers across B200/B300/MI355X × FP4/FP8 × MTP/non-MTP) passes this flag:glm5_fp8_mi355x.sh:53,glm5_fp8_mi355x_mtp.sh:54,glm5_fp4_b200.sh:43,glm5_fp4_b200_mtp.sh:48,glm5_fp4_b300.sh:47,glm5_fp4_b300_mtp.sh:52,glm5_fp8_b200.sh:47,glm5_fp8_b200_mtp.sh:48,glm5_fp8_b300.sh:51,glm5_fp8_b300_mtp.sh:52,glm5.1_fp4_mi355x.sh:54. The newglm5_fp4_mi355x_mtp.shis the sole exception.The three MI355X env vars, set at lines 26–28 of the FP8-MTP peer under the comment
# ROCm / SGLang performance tuning for MI355X:SGLANG_ROCM_FUSED_DECODE_MLA=0ROCM_QUICK_REDUCE_QUANTIZATION=INT4SAFETENSORS_FAST_GPU=1These three also appear in every other MI355X SGLang single-node script:
dsr1_fp4_mi355x.sh,dsr1_fp8_mi355x.sh,glm5.1_fp4_mi355x.sh,glm5_fp8_mi355x.sh,gptoss_fp4_mi355x.sh,kimik2.5_fp4_mi355x.sh,minimaxm2.5_fp8_mi355x.sh. The new script only exportsSGLANG_ENABLE_SPEC_V2=1(line 20).Why existing code does not catch this
SGLang does not error when
--kv-cache-dtypeis unspecified — it silently defaults to the model compute dtype (bf16/fp16 for the attention path). The three env vars are advisory tuning knobs; missing them simply changes kernel selection and I/O behaviour rather than failing startup. So the server will appear to launch fine, and the only symptom is quietly different benchmark numbers.Impact (step-by-step proof for
--kv-cache-dtype)Take the published config for this PR:
tp: 8, conc-start: 4, conc-end: 64for both 1k1k and 8k1k. With--mem-fraction-static 0.85on MI355X (288 GB HBM3e/GPU, so ~2.3 TB total at TP=8), SGLang carves out a fixed KV pool after weights/activations. Per-token KV memory scales linearly with KV cache element size — bf16/fp16 is 2 bytes/element, fp8_e4m3 is 1 byte/element. So for the same pool:--kv-cache-dtype fp8_e4m3: N tokens fit.glm5-fp4-mi355x-sglang-mtpwill therefore not be meaningfully comparable toglm5-fp8-mi355x-sglang-mtp, which is the whole point of adding this config.Impact (env vars)
SAFETENSORS_FAST_GPU=1speeds up weight load;ROCM_QUICK_REDUCE_QUANTIZATION=INT4selects the INT4 reduce path used uniformly on MI355X;SGLANG_ROCM_FUSED_DECODE_MLA=0is explicitly disabled across MI355X configs. Whether or not a given flag is a no-op for NSA (which GLM-5 uses instead of MLA), their absence means this script is the only MI355X single-node script with different runtime characteristics than the rest — an inconsistency that will show up in perf-changelog comparisons.How to fix
Mirror the
glm5_fp8_mi355x_mtp.shtemplate:SGLANG_ENABLE_SPEC_V2=1:--kv-cache-dtype fp8_e4m3 \to thesglang.launch_serverarg list (e.g. after--mem-fraction-static 0.85on line 41).Note: the existing inline review comment on this PR (id 3105939613) flags the missing NSA backend flags (
--nsa-prefill-backend tilelang --nsa-decode-backend tilelang), which is a separate omission from the FP8 MTP peer. The two findings are distinct and both warrant alignment withglm5_fp8_mi355x_mtp.sh.