Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2047,6 +2047,26 @@ qwen3.5-bf16-b300-sglang:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64 }
- { tp: 4, ep: 1, conc-start: 4, conc-end: 64 }

qwen3.5-bf16-b300-sglang-mtp:
image: lmsysorg/sglang:v0.5.10.post1-cu130
model: Qwen/Qwen3.5-397B-A17B
model-prefix: qwen3.5
runner: b300
precision: bf16
framework: sglang
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
- { tp: 4, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
- { tp: 4, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }

kimik2.5-int4-b200-vllm:
image: vllm/vllm-openai:v0.15.1
model: moonshotai/Kimi-K2.5
Expand Down
98 changes: 98 additions & 0 deletions benchmarks/single_node/qwen3.5_bf16_b300_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

hf download "$MODEL"

export NCCL_NVLS_ENABLE=1
export SGL_ENABLE_JIT_DEEPGEMM=false
export SGLANG_ENABLE_FLASHINFER_GEMM=true
export PYTHONUNBUFFERED=1
Comment on lines +23 to +26
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new qwen3.5_bf16_b300_mtp.sh script is missing the SGLANG_ENABLE_SPEC_V2=1 environment variable that every other SGLang EAGLE MTP benchmark script in the repo includes, causing SGLang to fall back to the slower V1 speculative decoding path. This degrades MTP throughput and may produce inflated acceptance rates, making benchmark results non-comparable to the FP8 B300 MTP config. Fix by prepending SGLANG_ENABLE_SPEC_V2=1 to the server launch line (line 55) alongside the existing PYTHONNOUSERSITE=1 prefix.

Extended reasoning...

What the bug is and how it manifests

The new benchmarks/single_node/qwen3.5_bf16_b300_mtp.sh script adds EAGLE speculative decoding flags (--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4) to the SGLang server launch command at line 55, but omits the SGLANG_ENABLE_SPEC_V2=1 environment variable. Without this flag, SGLang selects its legacy V1 speculative decoding code path even when EAGLE arguments are supplied.

The specific code path that triggers it

Line 55 of the new script reads:

PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...

The directly analogous script for the same hardware and same image (qwen3.5_fp8_b300_mtp.sh, PR #1035, line 34) reads:

SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...

The same pattern appears in dsr1_fp8_b200_mtp.sh (line 57), dsr1_fp8_b300_mtp.sh (line 61), and qwen3.5_fp8_h200_mtp.sh (line 38). All four scripts using SGLang with EAGLE on recent images include the flag; the new BF16 B300 script is the sole outlier.

Why existing code does not prevent it

The EAGLE speculative decoding CLI flags are passed correctly — SGLANG_ENABLE_SPEC_V2 is a separate runtime toggle that must be exported as an env var before the Python process starts. SGLang does not error or warn when the flag is absent; it silently downgrades to the V1 path, so there is no automatic signal that the script is misconfigured.

What the impact would be

Benchmark runs will use SGLang's slower, less optimised V1 speculation path. PR #1017 was a dedicated follow-up fix titled 'Enable SGLANG_ENABLE_SPEC_V2=1 for Qwen3.5 FP8 H200 SGLang MTP', demonstrating that the omission has a real, documented performance impact. Additionally, the V1 path can produce artificially high speculative acceptance rates, meaning the reported MTP numbers would not be comparable to the FP8 B300 MTP config that does use V2.

How to fix it

Prepend SGLANG_ENABLE_SPEC_V2=1 to the server launch line, matching the pattern in qwen3.5_fp8_b300_mtp.sh:

SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...

Alternatively, add export SGLANG_ENABLE_SPEC_V2=1 alongside the other export statements at lines 23-26.

Step-by-step proof

  1. The script exports NCCL_NVLS_ENABLE=1, SGL_ENABLE_JIT_DEEPGEMM=false, SGLANG_ENABLE_FLASHINFER_GEMM=true, and PYTHONUNBUFFERED=1 (lines 23-26), but not SGLANG_ENABLE_SPEC_V2.
  2. At line 55 the server is launched with PYTHONNOUSERSITE=1 python3 -m sglang.launch_server — no SGLANG_ENABLE_SPEC_V2=1 prefix.
  3. SGLang checks this env var at startup to decide which speculation engine to use; since it is unset (defaults to 0/false), it activates the V1 path.
  4. The four comparable MTP scripts (dsr1_fp8_b200_mtp.sh, dsr1_fp8_b300_mtp.sh, qwen3.5_fp8_h200_mtp.sh, qwen3.5_fp8_b300_mtp.sh) all set the flag, so their results are produced by the V2 path.
  5. Any throughput or acceptance-rate comparison between the new BF16 B300 MTP config and existing MTP configs will therefore compare V1 results against V2 results — an apples-to-oranges comparison that corrupts benchmark conclusions.


SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# Default: recv every ~10 requests; if CONC ≥ 16, relax to ~30 requests between scheduler recv polls.
if [[ $CONC -ge 16 ]]; then
SCHEDULER_RECV_INTERVAL=30
else
SCHEDULER_RECV_INTERVAL=10
fi

MEM_FRAC_STATIC=0.82
CHUNKED_PREFILL_SIZE=32768
MAX_PREFILL_TOKENS=32768
CUDA_GRAPH_MAX_BATCH_SIZE=$CONC
MAX_RUNNING_REQUESTS=128
CONTEXT_LENGTH=$((ISL + OSL + 20))
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
CONTEXT_LENGTH="$EVAL_MAX_MODEL_LEN"
fi

echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL"

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
--served-model-name "Qwen/Qwen3.5-397B-A17B" --trust-remote-code \
--tensor-parallel-size=$TP --data-parallel-size=1 --ep-size $EP_SIZE \
--cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
--mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \
--context-length $CONTEXT_LENGTH --disable-radix-cache \
--attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \
--enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
--tokenizer-worker-num 6 --stream-interval 30 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
> $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--use-chat-template

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
10 changes: 10 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1506,3 +1506,13 @@
- "Mirrors the qwen3.5-bf16-b200-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4)"
- "Configs: 1k1k and 8k1k, TP=8/EP=1 conc 4-64 with spec-decoding=mtp"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX

- config-keys:
- qwen3.5-bf16-b300-sglang-mtp
description:
- "Add Qwen3.5-397B-A17B BF16 B300 SGLang MTP benchmark"
- "Image: lmsysorg/sglang:v0.5.10.post1-cu130"
- "Model: Qwen/Qwen3.5-397B-A17B"
- "Mirrors the qwen3.5-bf16-b300-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4)"
- "Configs: 1k1k and 8k1k, TP8/EP1 conc 4-64 + TP4/EP1 conc 4-64, spec-decoding=mtp"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX
Loading