Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2027,6 +2027,26 @@ qwen3.5-fp4-b300-sglang:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 128 }
- { tp: 2, ep: 2, conc-start: 4, conc-end: 128 }

qwen3.5-fp4-b300-sglang-mtp:
image: lmsysorg/sglang:v0.5.10.post1-cu130
model: nvidia/Qwen3.5-397B-A17B-NVFP4
model-prefix: qwen3.5
runner: b300
precision: fp4
framework: sglang
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 128, spec-decoding: mtp }
- { tp: 2, ep: 2, conc-start: 4, conc-end: 128, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 128, spec-decoding: mtp }
- { tp: 2, ep: 2, conc-start: 4, conc-end: 128, spec-decoding: mtp }

qwen3.5-bf16-b300-sglang:
image: lmsysorg/sglang:v0.5.10.post1-cu130
model: Qwen/Qwen3.5-397B-A17B
Expand Down
113 changes: 113 additions & 0 deletions benchmarks/single_node/qwen3.5_fp4_b300_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
#!/usr/bin/env bash

# Follows the SGLang cookbook recipe at
# https://cookbook.sglang.io/autoregressive/Qwen/Qwen3.5 as of 2026-04-17.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

hf download "$MODEL"

export NCCL_NVLS_ENABLE=1
export SGL_ENABLE_JIT_DEEPGEMM=false
export SGLANG_ENABLE_FLASHINFER_GEMM=true
export PYTHONUNBUFFERED=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# Default: recv every ~10 requests; if CONC >= 16, relax to ~30 requests between scheduler recv polls.
if [[ $CONC -ge 16 ]]; then
SCHEDULER_RECV_INTERVAL=30
else
SCHEDULER_RECV_INTERVAL=10
fi

MEM_FRAC_STATIC=0.8
CHUNKED_PREFILL_SIZE=32768
MAX_PREFILL_TOKENS=32768
CUDA_GRAPH_MAX_BATCH_SIZE=$CONC
MAX_RUNNING_REQUESTS=128
CONTEXT_LENGTH=$((ISL + OSL + 20))
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
CONTEXT_LENGTH="$EVAL_MAX_MODEL_LEN"
fi

if [[ $TP -eq 8 ]]; then
EXTRA_ARGS="--enable-flashinfer-allreduce-fusion"
else
EXTRA_ARGS=""
fi

echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL"

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
--trust-remote-code \
--tensor-parallel-size=$TP --data-parallel-size=1 --ep-size $EP_SIZE \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--mamba-scheduler-strategy no_buffer \
--quantization modelopt_fp4 --fp4-gemm-backend flashinfer_cutlass \
--kv-cache-dtype fp8_e4m3 \
--mamba-ssm-dtype bfloat16 \
--cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
--mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \
Comment on lines +62 to +74
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new qwen3.5_fp4_b300_mtp.sh script is missing SGLANG_ENABLE_SPEC_V2=1 before the python3 -m sglang.launch_server invocation. Without this flag, EAGLE speculative decoding will fall back to the older spec v1 code path, producing inaccurate or suboptimal benchmark results — add SGLANG_ENABLE_SPEC_V2=1 as an inline env var prefix before PYTHONNOUSERSITE=1 python3 on line 62.

Extended reasoning...

What the bug is and how it manifests

The new benchmarks/single_node/qwen3.5_fp4_b300_mtp.sh launches the SGLang server at line 62 with:

PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...

It omits the SGLANG_ENABLE_SPEC_V2=1 env-var prefix that every other MTP/EAGLE script in the repository includes. Without this flag, SGLang selects the older speculative-decoding v1 code path even though the EAGLE algorithm requires the v2 path.

The specific code path that triggers it

Every other MTP benchmark script sets the flag inline before the python3 invocation:

  • qwen3.5_fp8_b300_mtp.sh line 34: SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...
  • qwen3.5_fp8_h200_mtp.sh line 38: SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server ...
  • dsr1_fp8_b200_mtp.sh line 57: SGLANG_ENABLE_SPEC_V2=1 ...
  • dsr1_fp8_b300_mtp.sh line 61: SGLANG_ENABLE_SPEC_V2=1 ...

This PR's script is the only MTP launch script in the repo that omits it.

Why existing code doesn't prevent it

There is no global export of SGLANG_ENABLE_SPEC_V2 in benchmark_lib.sh or the container entrypoint; each script is responsible for setting it inline. The bash syntax check (bash -n) listed in the test plan confirms only syntax validity, not correctness of env vars. The omission silently degrades behavior at runtime.

What the impact would be

SGLang v0.5.10.post1-cu130 requires SGLANG_ENABLE_SPEC_V2=1 for EAGLE speculative decoding to use the optimised v2 scheduler. Without it, the server runs the v1 speculative path, which yields lower acceptance rates and reduced throughput — meaning all benchmark numbers (tokens/s, TTFT, ITL) collected under this config will be unrepresentative of the intended MTP configuration. The perf-changelog entry for PR #1017 explicitly documents this requirement: "Enable SGLANG_ENABLE_SPEC_V2=1 for Qwen3.5 FP8 H200 SGLang MTP" because EAGLE requires spec v2.

How to fix it

Prepend SGLANG_ENABLE_SPEC_V2=1 to the server launch line, matching the pattern of all other MTP scripts:

SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL ...

Step-by-step proof

  1. The YAML config (.github/configs/nvidia-master.yaml) marks all search-space entries with spec-decoding: mtp, meaning the runner selects this _mtp.sh variant specifically to exercise EAGLE speculative decoding.
  2. The script passes --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 to the server — confirming EAGLE is intended.
  3. However, without SGLANG_ENABLE_SPEC_V2=1, SGLang's internal feature flag for the v2 speculative scheduler remains false.
  4. SGLang falls back to the v1 path: the EAGLE draft model still runs, but the v1 scheduler does not handle EAGLE's multi-token acceptance correctly, leading to degraded throughput and inaccurate acceptance-rate telemetry.
  5. Any benchmark result filed under this config will therefore underrepresent true MTP performance — the exact issue PR [NV] Update: sglang v2 Qwen3.5 h200 MTP #1017 was created to fix for the FP8 H200 MTP script.

--context-length $CONTEXT_LENGTH --disable-radix-cache \
--attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \
$EXTRA_ARGS --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
--tokenizer-worker-num 6 --stream-interval 30 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
> $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--use-chat-template

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
10 changes: 10 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1516,3 +1516,13 @@
- "Mirrors the qwen3.5-bf16-b300-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4)"
- "Configs: 1k1k and 8k1k, TP8/EP1 conc 4-64 + TP4/EP1 conc 4-64, spec-decoding=mtp"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX

- config-keys:
- qwen3.5-fp4-b300-sglang-mtp
description:
- "Add Qwen3.5-397B-A17B NVFP4 B300 SGLang MTP benchmark"
- "Image: lmsysorg/sglang:v0.5.10.post1-cu130"
- "Model: nvidia/Qwen3.5-397B-A17B-NVFP4"
- "Mirrors the qwen3.5-fp4-b300-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4)"
- "Configs: 1k1k and 8k1k, TP4/EP1 conc 4-128 + TP2/EP2 conc 4-128, spec-decoding=mtp"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX
Loading