-
Notifications
You must be signed in to change notification settings - Fork 156
Add B300 config: qwen3.5-bf16-sglang-mtp #1082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+128
−0
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| EP_SIZE | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| export NCCL_NVLS_ENABLE=1 | ||
| export SGL_ENABLE_JIT_DEEPGEMM=false | ||
| export SGLANG_ENABLE_FLASHINFER_GEMM=true | ||
| export PYTHONUNBUFFERED=1 | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| # Default: recv every ~10 requests; if CONC ≥ 16, relax to ~30 requests between scheduler recv polls. | ||
| if [[ $CONC -ge 16 ]]; then | ||
| SCHEDULER_RECV_INTERVAL=30 | ||
| else | ||
| SCHEDULER_RECV_INTERVAL=10 | ||
| fi | ||
|
|
||
| MEM_FRAC_STATIC=0.82 | ||
| CHUNKED_PREFILL_SIZE=32768 | ||
| MAX_PREFILL_TOKENS=32768 | ||
| CUDA_GRAPH_MAX_BATCH_SIZE=$CONC | ||
| MAX_RUNNING_REQUESTS=128 | ||
| CONTEXT_LENGTH=$((ISL + OSL + 20)) | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| CONTEXT_LENGTH="$EVAL_MAX_MODEL_LEN" | ||
| fi | ||
|
|
||
| echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL" | ||
|
|
||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \ | ||
| --served-model-name "Qwen/Qwen3.5-397B-A17B" --trust-remote-code \ | ||
| --tensor-parallel-size=$TP --data-parallel-size=1 --ep-size $EP_SIZE \ | ||
| --cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \ | ||
| --mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \ | ||
| --context-length $CONTEXT_LENGTH --disable-radix-cache \ | ||
| --attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \ | ||
| --enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \ | ||
| --tokenizer-worker-num 6 --stream-interval 30 \ | ||
| --speculative-algorithm EAGLE \ | ||
| --speculative-num-steps 3 \ | ||
| --speculative-eagle-topk 1 \ | ||
| --speculative-num-draft-tokens 4 \ | ||
| > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --use-chat-template | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The new
qwen3.5_bf16_b300_mtp.shscript is missing theSGLANG_ENABLE_SPEC_V2=1environment variable that every other SGLang EAGLE MTP benchmark script in the repo includes, causing SGLang to fall back to the slower V1 speculative decoding path. This degrades MTP throughput and may produce inflated acceptance rates, making benchmark results non-comparable to the FP8 B300 MTP config. Fix by prependingSGLANG_ENABLE_SPEC_V2=1to the server launch line (line 55) alongside the existingPYTHONNOUSERSITE=1prefix.Extended reasoning...
What the bug is and how it manifests
The new
benchmarks/single_node/qwen3.5_bf16_b300_mtp.shscript adds EAGLE speculative decoding flags (--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4) to the SGLang server launch command at line 55, but omits theSGLANG_ENABLE_SPEC_V2=1environment variable. Without this flag, SGLang selects its legacy V1 speculative decoding code path even when EAGLE arguments are supplied.The specific code path that triggers it
Line 55 of the new script reads:
The directly analogous script for the same hardware and same image (
qwen3.5_fp8_b300_mtp.sh, PR #1035, line 34) reads:The same pattern appears in
dsr1_fp8_b200_mtp.sh(line 57),dsr1_fp8_b300_mtp.sh(line 61), andqwen3.5_fp8_h200_mtp.sh(line 38). All four scripts using SGLang with EAGLE on recent images include the flag; the new BF16 B300 script is the sole outlier.Why existing code does not prevent it
The EAGLE speculative decoding CLI flags are passed correctly —
SGLANG_ENABLE_SPEC_V2is a separate runtime toggle that must be exported as an env var before the Python process starts. SGLang does not error or warn when the flag is absent; it silently downgrades to the V1 path, so there is no automatic signal that the script is misconfigured.What the impact would be
Benchmark runs will use SGLang's slower, less optimised V1 speculation path. PR #1017 was a dedicated follow-up fix titled 'Enable SGLANG_ENABLE_SPEC_V2=1 for Qwen3.5 FP8 H200 SGLang MTP', demonstrating that the omission has a real, documented performance impact. Additionally, the V1 path can produce artificially high speculative acceptance rates, meaning the reported MTP numbers would not be comparable to the FP8 B300 MTP config that does use V2.
How to fix it
Prepend
SGLANG_ENABLE_SPEC_V2=1to the server launch line, matching the pattern inqwen3.5_fp8_b300_mtp.sh:Alternatively, add
export SGLANG_ENABLE_SPEC_V2=1alongside the otherexportstatements at lines 23-26.Step-by-step proof
NCCL_NVLS_ENABLE=1,SGL_ENABLE_JIT_DEEPGEMM=false,SGLANG_ENABLE_FLASHINFER_GEMM=true, andPYTHONUNBUFFERED=1(lines 23-26), but notSGLANG_ENABLE_SPEC_V2.PYTHONNOUSERSITE=1 python3 -m sglang.launch_server— noSGLANG_ENABLE_SPEC_V2=1prefix.dsr1_fp8_b200_mtp.sh,dsr1_fp8_b300_mtp.sh,qwen3.5_fp8_h200_mtp.sh,qwen3.5_fp8_b300_mtp.sh) all set the flag, so their results are produced by the V2 path.