-
Notifications
You must be signed in to change notification settings - Fork 156
Add B300 config: qwen3.5-fp4-sglang-mtp #1083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,113 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # Follows the SGLang cookbook recipe at | ||
| # https://cookbook.sglang.io/autoregressive/Qwen/Qwen3.5 as of 2026-04-17. | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| EP_SIZE | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| export NCCL_NVLS_ENABLE=1 | ||
| export SGL_ENABLE_JIT_DEEPGEMM=false | ||
| export SGLANG_ENABLE_FLASHINFER_GEMM=true | ||
| export PYTHONUNBUFFERED=1 | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| # Default: recv every ~10 requests; if CONC >= 16, relax to ~30 requests between scheduler recv polls. | ||
| if [[ $CONC -ge 16 ]]; then | ||
| SCHEDULER_RECV_INTERVAL=30 | ||
| else | ||
| SCHEDULER_RECV_INTERVAL=10 | ||
| fi | ||
|
|
||
| MEM_FRAC_STATIC=0.8 | ||
| CHUNKED_PREFILL_SIZE=32768 | ||
| MAX_PREFILL_TOKENS=32768 | ||
| CUDA_GRAPH_MAX_BATCH_SIZE=$CONC | ||
| MAX_RUNNING_REQUESTS=128 | ||
| CONTEXT_LENGTH=$((ISL + OSL + 20)) | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| CONTEXT_LENGTH="$EVAL_MAX_MODEL_LEN" | ||
| fi | ||
|
|
||
| if [[ $TP -eq 8 ]]; then | ||
| EXTRA_ARGS="--enable-flashinfer-allreduce-fusion" | ||
| else | ||
| EXTRA_ARGS="" | ||
| fi | ||
|
|
||
| echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL" | ||
|
|
||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \ | ||
| --trust-remote-code \ | ||
| --tensor-parallel-size=$TP --data-parallel-size=1 --ep-size $EP_SIZE \ | ||
| --reasoning-parser qwen3 \ | ||
| --tool-call-parser qwen3_coder \ | ||
| --mamba-scheduler-strategy no_buffer \ | ||
| --quantization modelopt_fp4 --fp4-gemm-backend flashinfer_cutlass \ | ||
| --kv-cache-dtype fp8_e4m3 \ | ||
| --mamba-ssm-dtype bfloat16 \ | ||
| --cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \ | ||
| --mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \ | ||
| --context-length $CONTEXT_LENGTH --disable-radix-cache \ | ||
| --attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \ | ||
| $EXTRA_ARGS --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \ | ||
| --tokenizer-worker-num 6 --stream-interval 30 \ | ||
| --speculative-algorithm EAGLE \ | ||
| --speculative-num-steps 3 \ | ||
| --speculative-eagle-topk 1 \ | ||
| --speculative-num-draft-tokens 4 \ | ||
| > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --use-chat-template | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The new
qwen3.5_fp4_b300_mtp.shscript is missingSGLANG_ENABLE_SPEC_V2=1before thepython3 -m sglang.launch_serverinvocation. Without this flag, EAGLE speculative decoding will fall back to the older spec v1 code path, producing inaccurate or suboptimal benchmark results — addSGLANG_ENABLE_SPEC_V2=1as an inline env var prefix beforePYTHONNOUSERSITE=1 python3on line 62.Extended reasoning...
What the bug is and how it manifests
The new
benchmarks/single_node/qwen3.5_fp4_b300_mtp.shlaunches the SGLang server at line 62 with:It omits the
SGLANG_ENABLE_SPEC_V2=1env-var prefix that every other MTP/EAGLE script in the repository includes. Without this flag, SGLang selects the older speculative-decoding v1 code path even though the EAGLE algorithm requires the v2 path.The specific code path that triggers it
Every other MTP benchmark script sets the flag inline before the python3 invocation:
qwen3.5_fp8_b300_mtp.shline 34:SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server ...qwen3.5_fp8_h200_mtp.shline 38:SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server ...dsr1_fp8_b200_mtp.shline 57:SGLANG_ENABLE_SPEC_V2=1 ...dsr1_fp8_b300_mtp.shline 61:SGLANG_ENABLE_SPEC_V2=1 ...This PR's script is the only MTP launch script in the repo that omits it.
Why existing code doesn't prevent it
There is no global export of
SGLANG_ENABLE_SPEC_V2inbenchmark_lib.shor the container entrypoint; each script is responsible for setting it inline. The bash syntax check (bash -n) listed in the test plan confirms only syntax validity, not correctness of env vars. The omission silently degrades behavior at runtime.What the impact would be
SGLang v0.5.10.post1-cu130 requires
SGLANG_ENABLE_SPEC_V2=1for EAGLE speculative decoding to use the optimised v2 scheduler. Without it, the server runs the v1 speculative path, which yields lower acceptance rates and reduced throughput — meaning all benchmark numbers (tokens/s, TTFT, ITL) collected under this config will be unrepresentative of the intended MTP configuration. The perf-changelog entry for PR #1017 explicitly documents this requirement: "Enable SGLANG_ENABLE_SPEC_V2=1 for Qwen3.5 FP8 H200 SGLang MTP" because EAGLE requires spec v2.How to fix it
Prepend
SGLANG_ENABLE_SPEC_V2=1to the server launch line, matching the pattern of all other MTP scripts:SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL ...Step-by-step proof
.github/configs/nvidia-master.yaml) marks all search-space entries withspec-decoding: mtp, meaning the runner selects this_mtp.shvariant specifically to exercise EAGLE speculative decoding.--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4to the server — confirming EAGLE is intended.SGLANG_ENABLE_SPEC_V2=1, SGLang's internal feature flag for the v2 speculative scheduler remainsfalse.