Add MI355X config: qwen3.5-bf16-sglang-mtp by functionstackx · Pull Request #1077 · SemiAnalysisAI/InferenceX

functionstackx · 2026-04-18T00:25:54Z

Summary

Adds qwen3.5-bf16-mi355x-sglang-mtp config mirroring the existing qwen3.5-bf16-mi355x-sglang non-MTP recipe, plus a new benchmarks/single_node/qwen3.5_bf16_mi355x_mtp.sh launch script.
Adds EAGLE speculative decoding flags on top of the non-MTP script: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.
Search space rows carry spec-decoding: mtp so the MI355X runner picks up the _mtp.sh variant.
Adds a perf-changelog.yaml entry (PR link placeholder; update after merge per AGENTS.md).

Test plan

YAML parses for both master config and perf-changelog.
bash -n benchmarks/single_node/qwen3.5_bf16_mi355x_mtp.sh — bash syntax OK.
python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/amd-master.yaml — emits 14 entries (2 ISL/OSL × 7 concurrencies) with spec-decoding=mtp.
CI sweep passes on MI355X.

🤖 Generated with Claude Code

github-actions · 2026-04-18T00:26:01Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-18T00:32:25Z

+    --model "$MODEL" \
+    --port "$PORT" \
+    --backend vllm \
+    --input-len "$ISL" \
+    --output-len "$OSL" \
+    --random-range-ratio "$RANDOM_RANGE_RATIO" \
+    --num-prompts "$((CONC * 10))" \
+    --max-concurrency "$CONC" \
+    --result-filename "$RESULT_FILENAME" \


🔴 The new qwen3.5_bf16_mi355x_mtp.sh script is missing --use-chat-template from its run_benchmark_serving call (lines 62–70), while every other MTP benchmark script in the repository includes this flag. Without it, EAGLE speculative decoding acceptance rates are artificially inflated because random prompts are not formatted as chat messages, making benchmark results not comparable to other Qwen3.5 MTP configs. Add --use-chat-template to the run_benchmark_serving invocation to match the pattern established by qwen3.5_fp8_b200_mtp.sh, qwen3.5_fp8_h200_mtp.sh, qwen3.5_fp8_b300_mtp.sh, and all DSR1 MTP scripts.

Extended reasoning...

What the bug is and how it manifests

The benchmarks/single_node/qwen3.5_bf16_mi355x_mtp.sh script launches SGLang with EAGLE speculative decoding (--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4) but then calls run_benchmark_serving without the --use-chat-template flag. This means the benchmark tool generates raw random token prompts rather than prompts that have been formatted using the model's chat template.

The specific code path that triggers it

In qwen3.5_bf16_mi355x_mtp.sh (lines 62–70), run_benchmark_serving is called with --model, --port, --backend vllm, --input-len, --output-len, --random-range-ratio, --num-prompts, --max-concurrency, --result-filename, and --result-dir — but --use-chat-template is absent. Every other MTP benchmark script in the repo passes this flag: qwen3.5_fp8_b200_mtp.sh (line 91), qwen3.5_fp8_h200_mtp.sh (line 82), qwen3.5_fp8_b300_mtp.sh (line 77), and all DSR1 MTP scripts (dsr1_fp8_b200_mtp.sh, dsr1_fp4_mi355x_atom_mtp.sh, dsr1_fp8_mi355x_atom_mtp.sh, dsr1_fp8_h200_trt_mtp.sh, dsr1_fp4_b200_trt_mtp.sh, dsr1_fp8_b300_mtp.sh). The non-MTP counterpart qwen3.5_bf16_mi355x.sh also lacks the flag, confirming this was inadvertently copied from the non-MTP script without adding the MTP-required flag.

Why existing code doesn't prevent it

There is no automated check enforcing --use-chat-template in MTP scripts. The script passes bash syntax validation (bash -n) without the flag. The error is a logical omission, not a syntax one.

What the impact would be

With EAGLE speculative decoding, the draft model proposes tokens conditioned on the training distribution, which includes the chat template's special tokens and formatting. When random raw-token prompts (without chat template structure) are used as inputs, the verifier accepts draft tokens at abnormally high rates because the output distribution is skewed. This inflates the reported MTP acceptance rate and throughput, producing numbers that overstate real-world performance gains from speculative decoding and are directly incomparable to the other Qwen3.5 MTP configs. This exact mechanism was explicitly documented as a bug fix in perf-changelog.yaml for PR #647 (dsr1-fp8-mi355x-sglang-disagg): 'Add --use-chat-template argument to benchmark_serving script. Without this arg, MTP acceptance rates are artificially high for DeepSeek with MTP'.

How to fix it

Add --use-chat-template \\ to the run_benchmark_serving call in qwen3.5_bf16_mi355x_mtp.sh, matching the pattern used in all other MTP benchmark scripts.

Step-by-step proof

The script starts an SGLang server with EAGLE speculative decoding (lines 40–56).

run_benchmark_serving is called at lines 58–70 without --use-chat-template.

Without the flag, the benchmark tool generates prompts sampled from random token IDs, not from the chat-formatted distribution the model was trained on.

The EAGLE draft model has learned to predict tokens that follow chat-template patterns (e.g., <|im_start|>assistant\n...). When the prompt lacks these patterns, the draft model's proposals happen to coincide with what the verifier would generate at an inflated rate — not because speculative decoding is working well, but because the input distribution is anomalous.

The result: reported draft token acceptance rates and throughput figures are higher than they would be with real user inputs, making the benchmark non-representative and not comparable to qwen3.5_fp8_b200_mtp.sh, qwen3.5_fp8_h200_mtp.sh, and qwen3.5_fp8_b300_mtp.sh, which all include --use-chat-template.

Mirrors the existing qwen3.5-bf16-mi355x-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) via the standard spec-decoding=mtp suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Qwen3.5 MTP (EAGLE) benchmarks need the chat template applied so the client-side prompts match what the model was trained to predict; without it the spec-decoding quality regresses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team April 18, 2026 00:25

functionstackx requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners April 18, 2026 00:25

github-project-automation Bot added this to InferenceMAX Board Apr 18, 2026

functionstackx added the sweep-enabled label Apr 18, 2026

claude Bot reviewed Apr 18, 2026

View reviewed changes

functionstackx mentioned this pull request Apr 18, 2026

Docs: require --use-chat-template in MTP scripts #1079

Merged

1 task

functionstackx and others added 2 commits April 18, 2026 00:59

functionstackx force-pushed the claude/add-qwen3.5-bf16-mi355x-mtp branch from c39ab19 to 55b2975 Compare April 18, 2026 04:59

functionstackx merged commit dd29308 into main Apr 18, 2026
3 checks passed

functionstackx deleted the claude/add-qwen3.5-bf16-mi355x-mtp branch April 18, 2026 04:59

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MI355X config: qwen3.5-bf16-sglang-mtp#1077

Add MI355X config: qwen3.5-bf16-sglang-mtp#1077
functionstackx merged 2 commits intomainfrom
claude/add-qwen3.5-bf16-mi355x-mtp

functionstackx commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

claude Bot Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented Apr 18, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

claude Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant