-
Notifications
You must be signed in to change notification settings - Fork 156
Add B300 config: glm5-fp8-sglang #1051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # NOTE: At the time of submission, https://cookbook.sglang.io/autoregressive/GLM/GLM-5.1 | ||
| # does not have a B300-specific recipe, so this script reuses the existing | ||
| # GLM5 FP8 B200 SGLang recipe as-is until B300-specific tuning is available. | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| pip install --no-deps "transformers==5.2.0" "huggingface-hub==1.4.1" | ||
|
|
||
| export SGL_ENABLE_JIT_DEEPGEMM=1 | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
|
|
||
| echo "CONC: $CONC, ISL: $ISL, OSL: $OSL" | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| fi | ||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \ | ||
| --trust-remote-code \ | ||
| --tensor-parallel-size=$TP \ | ||
| --data-parallel-size 1 --expert-parallel-size 1 \ | ||
| --tool-call-parser glm47 \ | ||
| --reasoning-parser glm45 \ | ||
| --kv-cache-dtype fp8_e4m3 --quantization fp8 \ | ||
| --attention-backend nsa \ | ||
| --nsa-decode-backend trtllm --nsa-prefill-backend trtllm \ | ||
| --moe-runner-backend flashinfer_trtllm \ | ||
| --cuda-graph-max-bs $CONC --max-running-requests $CONC \ | ||
| --mem-fraction-static 0.85 \ | ||
| --chunked-prefill-size 32768 --max-prefill-tokens 32768 \ | ||
| --enable-flashinfer-allreduce-fusion --disable-radix-cache \ | ||
| --stream-interval 30 \ | ||
| --model-loader-extra-config '{"enable_multithread_load": true}' $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1412,3 +1412,11 @@ | |
| - "Image: lmsysorg/sglang:v0.5.10.post1-cu130" | ||
| - "At the time of submission, https://cookbook.sglang.io/autoregressive/DeepSeek/DeepSeek-R1 does not have a B300-specific recipe, so this reuses the existing DSR1 FP8 B200 SGLang recipe as-is" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1050 | ||
|
|
||
| - config-keys: | ||
| - glm5-fp8-b300-sglang | ||
| description: | ||
| - "Add GLM-5 FP8 B300 SGLang benchmark" | ||
| - "Image: lmsysorg/sglang:v0.5.10.post1-cu130" | ||
| - "At the time of submission, https://cookbook.sglang.io/autoregressive/GLM/GLM-5.1 does not have a B300-specific recipe, so this reuses the existing GLM5 FP8 B200 SGLang recipe as-is" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1051 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 The glm5-fp8-b300-sglang entry in perf-changelog.yaml has a placeholder PR link (pull/XXXX) instead of the actual PR number 1051. It should be updated to #1051. Extended reasoning...The perf-changelog.yaml entry for glm5-fp8-b300-sglang was committed with an unresolved placeholder in its pr-link field. The current HEAD of the file contains "pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX" at line 1398, while the PR diff shows the intended value was pull/1051. The specific code path is straightforward: commit d6e32c3 (the PR #1051 merge commit) introduced the glm5-fp8-b300-sglang entry to perf-changelog.yaml, but the author never replaced the XXXX placeholder before merging. The diff clearly shows pull/1051 as the intended value, yet the committed content still has XXXX. Nothing in the codebase prevents placeholder values from being committed — there is no pre-commit validation or CI check that would catch an XXXX in a pr-link field. This explains how it slipped through. Other entries in the same file with XXX or XXXX placeholders confirm this is a recurring human error (e.g. glm5-fp8-mi355x-sglang, minimaxm2.5-fp8-h200-vllm). The impact is limited to documentation/metadata: anyone reading the changelog or trying to trace the history of this benchmark config would find a broken link. The placeholder XXXX does not affect benchmark execution, configuration parsing, or any runtime behavior. The fix is a one-line change: replace pull/XXXX with pull/1051 on line 1398 of perf-changelog.yaml. Step-by-step proof: (1) The PR diff shows the new entry ending with "pr-link: #1051". (2) Reading the actual file at HEAD shows the last line is "pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX". (3) Running git show d6e32c3 -- perf-changelog.yaml confirms the committed content has XXXX. (4) The immediately preceding entry (qwen3.5-fp8-b300-sglang-mtp, PR #1035) correctly references its PR number, confirming the XXXX in the glm5 entry is an oversight, not intentional. |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The new script calls
hf download "$MODEL"(line 24), but on B300 the runner overridesMODELto a local filesystem path (/scratch/models/GLM-5-FP8), which is not a valid HuggingFace repo ID — causinghf downloadto fail. Remove line 24; models are pre-staged on B300, as confirmed byqwen3.5_fp8_b300_mtp.shwhich correctly omits this call.Extended reasoning...
What the bug is and how it manifests
The new benchmark script
benchmarks/single_node/glm5_fp8_b300.sh(line 24) callshf download "$MODEL". ThehfCLI'sdownloadsubcommand expects a HuggingFace repository identifier inowner/repoformat (e.g.zai-org/GLM-5-FP8). Passing a local filesystem path instead causes the command to exit with an error.The specific code path that triggers it
In
runners/launch_b300-nv.sh(single-node branch, line 220), the runner transforms the model identifier before invoking the benchmark script:So the original config value
zai-org/GLM-5-FP8becomes/scratch/models/GLM-5-FP8. The benchmark script then executeshf download "/scratch/models/GLM-5-FP8", which is not a valid repo ID.Why existing code doesn't prevent it
There is no
set -ebefore line 24, so the script continues execution afterhf downloadfails. The SGLang server is then started with--model-path=$MODEL, which correctly points to the pre-staged local path — so the benchmark itself still runs. This masks the bug during casual observation but leaves a broken command and spurious error output in every run's logs.What the impact would be
Every B300 run of this config will produce an error from
hf downloadin the logs. If the B300 environment ever changes so that/scratch/models/is not pre-populated (e.g. a new node or a CI dry-run), the benchmark would fail to start because the model would be absent and the server launch would fail. The spurious error also makes log triage harder for operators.How to fix it
Remove line 24 (
hf download "$MODEL") frombenchmarks/single_node/glm5_fp8_b300.sh. Models are pre-staged at/scratch/models/on B300, so no download step is needed. This matches the pattern of the existing B300 SGLang single-node scriptbenchmarks/single_node/qwen3.5_fp8_b300_mtp.sh, which has nohf downloadcall.Step-by-step proof
model: zai-org/GLM-5-FP8andrunner: b300.launch_b300-nv.shsingle-node branch (line 220) executes:export MODEL="/scratch/models/${MODEL#*/}"→MODELbecomes/scratch/models/GLM-5-FP8.MODEL.glm5_fp8_b300.shline 24 executes:hf download "/scratch/models/GLM-5-FP8".hf downloadfails because/scratch/models/GLM-5-FP8is not anorg/repoidentifier.set -eat this point, execution continues to the SGLang server launch, which uses--model-path=$MODEL(the pre-staged path) and succeeds — hiding the error from benchmarking results but leaving it in logs.