Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1870,6 +1870,27 @@ glm5-fp8-b200-sglang:
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 256 }

# NOTE: At the time of submission, https://cookbook.sglang.io/autoregressive/GLM/GLM-5.1
# does not have a B300-specific recipe, so this config reuses the existing GLM5 FP8
# B200 SGLang recipe as-is until B300-specific tuning is available.
glm5-fp8-b300-sglang:
image: lmsysorg/sglang:v0.5.10.post1-cu130
model: zai-org/GLM-5-FP8
model-prefix: glm5
runner: b300
precision: fp8
framework: sglang
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 256 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 256 }

glm5-fp4-b200-sglang:
image: lmsysorg/sglang:v0.5.10.post1-cu130
model: nvidia/GLM-5-NVFP4
Expand Down
89 changes: 89 additions & 0 deletions benchmarks/single_node/glm5_fp8_b300.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
#!/usr/bin/env bash

# NOTE: At the time of submission, https://cookbook.sglang.io/autoregressive/GLM/GLM-5.1
# does not have a B300-specific recipe, so this script reuses the existing
# GLM5 FP8 B200 SGLang recipe as-is until B300-specific tuning is available.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

hf download "$MODEL"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new script calls hf download "$MODEL" (line 24), but on B300 the runner overrides MODEL to a local filesystem path (/scratch/models/GLM-5-FP8), which is not a valid HuggingFace repo ID — causing hf download to fail. Remove line 24; models are pre-staged on B300, as confirmed by qwen3.5_fp8_b300_mtp.sh which correctly omits this call.

Extended reasoning...

What the bug is and how it manifests

The new benchmark script benchmarks/single_node/glm5_fp8_b300.sh (line 24) calls hf download "$MODEL". The hf CLI's download subcommand expects a HuggingFace repository identifier in owner/repo format (e.g. zai-org/GLM-5-FP8). Passing a local filesystem path instead causes the command to exit with an error.

The specific code path that triggers it

In runners/launch_b300-nv.sh (single-node branch, line 220), the runner transforms the model identifier before invoking the benchmark script:

export MODEL="/scratch/models/${MODEL#*/}"

So the original config value zai-org/GLM-5-FP8 becomes /scratch/models/GLM-5-FP8. The benchmark script then executes hf download "/scratch/models/GLM-5-FP8", which is not a valid repo ID.

Why existing code doesn't prevent it

There is no set -e before line 24, so the script continues execution after hf download fails. The SGLang server is then started with --model-path=$MODEL, which correctly points to the pre-staged local path — so the benchmark itself still runs. This masks the bug during casual observation but leaves a broken command and spurious error output in every run's logs.

What the impact would be

Every B300 run of this config will produce an error from hf download in the logs. If the B300 environment ever changes so that /scratch/models/ is not pre-populated (e.g. a new node or a CI dry-run), the benchmark would fail to start because the model would be absent and the server launch would fail. The spurious error also makes log triage harder for operators.

How to fix it

Remove line 24 (hf download "$MODEL") from benchmarks/single_node/glm5_fp8_b300.sh. Models are pre-staged at /scratch/models/ on B300, so no download step is needed. This matches the pattern of the existing B300 SGLang single-node script benchmarks/single_node/qwen3.5_fp8_b300_mtp.sh, which has no hf download call.

Step-by-step proof

  1. nvidia-master.yaml config specifies model: zai-org/GLM-5-FP8 and runner: b300.
  2. launch_b300-nv.sh single-node branch (line 220) executes: export MODEL="/scratch/models/${MODEL#*/}"MODEL becomes /scratch/models/GLM-5-FP8.
  3. The runner then calls the benchmark script with this modified MODEL.
  4. glm5_fp8_b300.sh line 24 executes: hf download "/scratch/models/GLM-5-FP8".
  5. hf download fails because /scratch/models/GLM-5-FP8 is not an org/repo identifier.
  6. Since there is no set -e at this point, execution continues to the SGLang server launch, which uses --model-path=$MODEL (the pre-staged path) and succeeds — hiding the error from benchmarking results but leaving it in logs.


pip install --no-deps "transformers==5.2.0" "huggingface-hub==1.4.1"

export SGL_ENABLE_JIT_DEEPGEMM=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}


echo "CONC: $CONC, ISL: $ISL, OSL: $OSL"

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
--trust-remote-code \
--tensor-parallel-size=$TP \
--data-parallel-size 1 --expert-parallel-size 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--kv-cache-dtype fp8_e4m3 --quantization fp8 \
--attention-backend nsa \
--nsa-decode-backend trtllm --nsa-prefill-backend trtllm \
--moe-runner-backend flashinfer_trtllm \
--cuda-graph-max-bs $CONC --max-running-requests $CONC \
--mem-fraction-static 0.85 \
--chunked-prefill-size 32768 --max-prefill-tokens 32768 \
--enable-flashinfer-allreduce-fusion --disable-radix-cache \
--stream-interval 30 \
--model-loader-extra-config '{"enable_multithread_load": true}' $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
8 changes: 8 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1412,3 +1412,11 @@
- "Image: lmsysorg/sglang:v0.5.10.post1-cu130"
- "At the time of submission, https://cookbook.sglang.io/autoregressive/DeepSeek/DeepSeek-R1 does not have a B300-specific recipe, so this reuses the existing DSR1 FP8 B200 SGLang recipe as-is"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1050

- config-keys:
- glm5-fp8-b300-sglang
description:
- "Add GLM-5 FP8 B300 SGLang benchmark"
- "Image: lmsysorg/sglang:v0.5.10.post1-cu130"
- "At the time of submission, https://cookbook.sglang.io/autoregressive/GLM/GLM-5.1 does not have a B300-specific recipe, so this reuses the existing GLM5 FP8 B200 SGLang recipe as-is"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1051
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The glm5-fp8-b300-sglang entry in perf-changelog.yaml has a placeholder PR link (pull/XXXX) instead of the actual PR number 1051. It should be updated to #1051.

Extended reasoning...

The perf-changelog.yaml entry for glm5-fp8-b300-sglang was committed with an unresolved placeholder in its pr-link field. The current HEAD of the file contains "pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX" at line 1398, while the PR diff shows the intended value was pull/1051.

The specific code path is straightforward: commit d6e32c3 (the PR #1051 merge commit) introduced the glm5-fp8-b300-sglang entry to perf-changelog.yaml, but the author never replaced the XXXX placeholder before merging. The diff clearly shows pull/1051 as the intended value, yet the committed content still has XXXX.

Nothing in the codebase prevents placeholder values from being committed — there is no pre-commit validation or CI check that would catch an XXXX in a pr-link field. This explains how it slipped through. Other entries in the same file with XXX or XXXX placeholders confirm this is a recurring human error (e.g. glm5-fp8-mi355x-sglang, minimaxm2.5-fp8-h200-vllm).

The impact is limited to documentation/metadata: anyone reading the changelog or trying to trace the history of this benchmark config would find a broken link. The placeholder XXXX does not affect benchmark execution, configuration parsing, or any runtime behavior.

The fix is a one-line change: replace pull/XXXX with pull/1051 on line 1398 of perf-changelog.yaml.

Step-by-step proof: (1) The PR diff shows the new entry ending with "pr-link: #1051". (2) Reading the actual file at HEAD shows the last line is "pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX". (3) Running git show d6e32c3 -- perf-changelog.yaml confirms the committed content has XXXX. (4) The immediately preceding entry (qwen3.5-fp8-b300-sglang-mtp, PR #1035) correctly references its PR number, confirming the XXXX in the glm5 entry is an oversight, not intentional.

Loading