Add MI355X config: qwen3.5-bf16-sglang-mtp#1077
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ |
There was a problem hiding this comment.
🔴 The new qwen3.5_bf16_mi355x_mtp.sh script is missing --use-chat-template from its run_benchmark_serving call (lines 62–70), while every other MTP benchmark script in the repository includes this flag. Without it, EAGLE speculative decoding acceptance rates are artificially inflated because random prompts are not formatted as chat messages, making benchmark results not comparable to other Qwen3.5 MTP configs. Add --use-chat-template to the run_benchmark_serving invocation to match the pattern established by qwen3.5_fp8_b200_mtp.sh, qwen3.5_fp8_h200_mtp.sh, qwen3.5_fp8_b300_mtp.sh, and all DSR1 MTP scripts.
Extended reasoning...
What the bug is and how it manifests
The benchmarks/single_node/qwen3.5_bf16_mi355x_mtp.sh script launches SGLang with EAGLE speculative decoding (--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4) but then calls run_benchmark_serving without the --use-chat-template flag. This means the benchmark tool generates raw random token prompts rather than prompts that have been formatted using the model's chat template.
The specific code path that triggers it
In qwen3.5_bf16_mi355x_mtp.sh (lines 62–70), run_benchmark_serving is called with --model, --port, --backend vllm, --input-len, --output-len, --random-range-ratio, --num-prompts, --max-concurrency, --result-filename, and --result-dir — but --use-chat-template is absent. Every other MTP benchmark script in the repo passes this flag: qwen3.5_fp8_b200_mtp.sh (line 91), qwen3.5_fp8_h200_mtp.sh (line 82), qwen3.5_fp8_b300_mtp.sh (line 77), and all DSR1 MTP scripts (dsr1_fp8_b200_mtp.sh, dsr1_fp4_mi355x_atom_mtp.sh, dsr1_fp8_mi355x_atom_mtp.sh, dsr1_fp8_h200_trt_mtp.sh, dsr1_fp4_b200_trt_mtp.sh, dsr1_fp8_b300_mtp.sh). The non-MTP counterpart qwen3.5_bf16_mi355x.sh also lacks the flag, confirming this was inadvertently copied from the non-MTP script without adding the MTP-required flag.
Why existing code doesn't prevent it
There is no automated check enforcing --use-chat-template in MTP scripts. The script passes bash syntax validation (bash -n) without the flag. The error is a logical omission, not a syntax one.
What the impact would be
With EAGLE speculative decoding, the draft model proposes tokens conditioned on the training distribution, which includes the chat template's special tokens and formatting. When random raw-token prompts (without chat template structure) are used as inputs, the verifier accepts draft tokens at abnormally high rates because the output distribution is skewed. This inflates the reported MTP acceptance rate and throughput, producing numbers that overstate real-world performance gains from speculative decoding and are directly incomparable to the other Qwen3.5 MTP configs. This exact mechanism was explicitly documented as a bug fix in perf-changelog.yaml for PR #647 (dsr1-fp8-mi355x-sglang-disagg): 'Add --use-chat-template argument to benchmark_serving script. Without this arg, MTP acceptance rates are artificially high for DeepSeek with MTP'.
How to fix it
Add --use-chat-template \\ to the run_benchmark_serving call in qwen3.5_bf16_mi355x_mtp.sh, matching the pattern used in all other MTP benchmark scripts.
Step-by-step proof
- The script starts an SGLang server with EAGLE speculative decoding (lines 40–56).
run_benchmark_servingis called at lines 58–70 without--use-chat-template.- Without the flag, the benchmark tool generates prompts sampled from random token IDs, not from the chat-formatted distribution the model was trained on.
- The EAGLE draft model has learned to predict tokens that follow chat-template patterns (e.g.,
<|im_start|>assistant\n...). When the prompt lacks these patterns, the draft model's proposals happen to coincide with what the verifier would generate at an inflated rate — not because speculative decoding is working well, but because the input distribution is anomalous. - The result: reported draft token acceptance rates and throughput figures are higher than they would be with real user inputs, making the benchmark non-representative and not comparable to
qwen3.5_fp8_b200_mtp.sh,qwen3.5_fp8_h200_mtp.sh, andqwen3.5_fp8_b300_mtp.sh, which all include--use-chat-template.
Mirrors the existing qwen3.5-bf16-mi355x-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) via the standard spec-decoding=mtp suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.5 MTP (EAGLE) benchmarks need the chat template applied so the client-side prompts match what the model was trained to predict; without it the spec-decoding quality regresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c39ab19 to
55b2975
Compare
Summary
qwen3.5-bf16-mi355x-sglang-mtpconfig mirroring the existingqwen3.5-bf16-mi355x-sglangnon-MTP recipe, plus a newbenchmarks/single_node/qwen3.5_bf16_mi355x_mtp.shlaunch script.--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.spec-decoding: mtpso the MI355X runner picks up the_mtp.shvariant.perf-changelog.yamlentry (PR link placeholder; update after merge per AGENTS.md).Test plan
bash -n benchmarks/single_node/qwen3.5_bf16_mi355x_mtp.sh— bash syntax OK.python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/amd-master.yaml— emits 14 entries (2 ISL/OSL × 7 concurrencies) with spec-decoding=mtp.🤖 Generated with Claude Code