Skip to content

Add MI355X config: qwen3.5-bf16-sglang-mtp#1077

Merged
functionstackx merged 2 commits intomainfrom
claude/add-qwen3.5-bf16-mi355x-mtp
Apr 18, 2026
Merged

Add MI355X config: qwen3.5-bf16-sglang-mtp#1077
functionstackx merged 2 commits intomainfrom
claude/add-qwen3.5-bf16-mi355x-mtp

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

Summary

  • Adds qwen3.5-bf16-mi355x-sglang-mtp config mirroring the existing qwen3.5-bf16-mi355x-sglang non-MTP recipe, plus a new benchmarks/single_node/qwen3.5_bf16_mi355x_mtp.sh launch script.
  • Adds EAGLE speculative decoding flags on top of the non-MTP script: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.
  • Search space rows carry spec-decoding: mtp so the MI355X runner picks up the _mtp.sh variant.
  • Adds a perf-changelog.yaml entry (PR link placeholder; update after merge per AGENTS.md).

Test plan

  • YAML parses for both master config and perf-changelog.
  • bash -n benchmarks/single_node/qwen3.5_bf16_mi355x_mtp.sh — bash syntax OK.
  • python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/amd-master.yaml — emits 14 entries (2 ISL/OSL × 7 concurrencies) with spec-decoding=mtp.
  • CI sweep passes on MI355X.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +62 to +70
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new qwen3.5_bf16_mi355x_mtp.sh script is missing --use-chat-template from its run_benchmark_serving call (lines 62–70), while every other MTP benchmark script in the repository includes this flag. Without it, EAGLE speculative decoding acceptance rates are artificially inflated because random prompts are not formatted as chat messages, making benchmark results not comparable to other Qwen3.5 MTP configs. Add --use-chat-template to the run_benchmark_serving invocation to match the pattern established by qwen3.5_fp8_b200_mtp.sh, qwen3.5_fp8_h200_mtp.sh, qwen3.5_fp8_b300_mtp.sh, and all DSR1 MTP scripts.

Extended reasoning...

What the bug is and how it manifests

The benchmarks/single_node/qwen3.5_bf16_mi355x_mtp.sh script launches SGLang with EAGLE speculative decoding (--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4) but then calls run_benchmark_serving without the --use-chat-template flag. This means the benchmark tool generates raw random token prompts rather than prompts that have been formatted using the model's chat template.

The specific code path that triggers it

In qwen3.5_bf16_mi355x_mtp.sh (lines 62–70), run_benchmark_serving is called with --model, --port, --backend vllm, --input-len, --output-len, --random-range-ratio, --num-prompts, --max-concurrency, --result-filename, and --result-dir — but --use-chat-template is absent. Every other MTP benchmark script in the repo passes this flag: qwen3.5_fp8_b200_mtp.sh (line 91), qwen3.5_fp8_h200_mtp.sh (line 82), qwen3.5_fp8_b300_mtp.sh (line 77), and all DSR1 MTP scripts (dsr1_fp8_b200_mtp.sh, dsr1_fp4_mi355x_atom_mtp.sh, dsr1_fp8_mi355x_atom_mtp.sh, dsr1_fp8_h200_trt_mtp.sh, dsr1_fp4_b200_trt_mtp.sh, dsr1_fp8_b300_mtp.sh). The non-MTP counterpart qwen3.5_bf16_mi355x.sh also lacks the flag, confirming this was inadvertently copied from the non-MTP script without adding the MTP-required flag.

Why existing code doesn't prevent it

There is no automated check enforcing --use-chat-template in MTP scripts. The script passes bash syntax validation (bash -n) without the flag. The error is a logical omission, not a syntax one.

What the impact would be

With EAGLE speculative decoding, the draft model proposes tokens conditioned on the training distribution, which includes the chat template's special tokens and formatting. When random raw-token prompts (without chat template structure) are used as inputs, the verifier accepts draft tokens at abnormally high rates because the output distribution is skewed. This inflates the reported MTP acceptance rate and throughput, producing numbers that overstate real-world performance gains from speculative decoding and are directly incomparable to the other Qwen3.5 MTP configs. This exact mechanism was explicitly documented as a bug fix in perf-changelog.yaml for PR #647 (dsr1-fp8-mi355x-sglang-disagg): 'Add --use-chat-template argument to benchmark_serving script. Without this arg, MTP acceptance rates are artificially high for DeepSeek with MTP'.

How to fix it

Add --use-chat-template \\ to the run_benchmark_serving call in qwen3.5_bf16_mi355x_mtp.sh, matching the pattern used in all other MTP benchmark scripts.

Step-by-step proof

  1. The script starts an SGLang server with EAGLE speculative decoding (lines 40–56).
  2. run_benchmark_serving is called at lines 58–70 without --use-chat-template.
  3. Without the flag, the benchmark tool generates prompts sampled from random token IDs, not from the chat-formatted distribution the model was trained on.
  4. The EAGLE draft model has learned to predict tokens that follow chat-template patterns (e.g., <|im_start|>assistant\n...). When the prompt lacks these patterns, the draft model's proposals happen to coincide with what the verifier would generate at an inflated rate — not because speculative decoding is working well, but because the input distribution is anomalous.
  5. The result: reported draft token acceptance rates and throughput figures are higher than they would be with real user inputs, making the benchmark non-representative and not comparable to qwen3.5_fp8_b200_mtp.sh, qwen3.5_fp8_h200_mtp.sh, and qwen3.5_fp8_b300_mtp.sh, which all include --use-chat-template.

functionstackx and others added 2 commits April 18, 2026 00:59
Mirrors the existing qwen3.5-bf16-mi355x-sglang non-MTP recipe and adds
EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4)
via the standard spec-decoding=mtp suffix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.5 MTP (EAGLE) benchmarks need the chat template applied so the
client-side prompts match what the model was trained to predict; without
it the spec-decoding quality regresses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the claude/add-qwen3.5-bf16-mi355x-mtp branch from c39ab19 to 55b2975 Compare April 18, 2026 04:59
@functionstackx functionstackx merged commit dd29308 into main Apr 18, 2026
3 checks passed
@functionstackx functionstackx deleted the claude/add-qwen3.5-bf16-mi355x-mtp branch April 18, 2026 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant