Add Multi-Token Prediction (MTP) support for Qwen3.5 by curnane-lab · Pull Request #45637 · huggingface/transformers

curnane-lab · 2026-04-24T15:14:11Z

Add Multi-Token Prediction (MTP) support for Qwen3.5

This PR adds Multi-Token Prediction (MTP) architecture and loss computation for Qwen3.5 models, enabling multi-token prediction during training for improved efficiency.

Changes

New classes:

Qwen3_5MTPLayer: Single MTP transformer layer with attention and MLP
Qwen3_5MTP: Top-level MTP module with FC fusion, layers, and norm

New shared helper:

_compute_qwen35_mtp_loss(): Shared MTP loss computation function used by both CausalLM and VL models, eliminating code duplication

Modified models:

Qwen3_5ForCausalLM: Added MTP initialization and loss computation in forward pass
Qwen3_5ForConditionalGeneration: Added MTP initialization and loss computation in forward pass

Configuration:

Added mtp_num_hidden_layers (default: 0) and mtp_loss_weight (default: 0.0) to both Qwen3_5TextConfig and Qwen3_5Config
Removed mtp from _keys_to_ignore_on_load_unexpected in Qwen3_5ForCausalLM so MTP weights are properly loaded from checkpoints

Design decisions

Shared loss function: The _compute_qwen35_mtp_loss() helper eliminates code duplication between the text-only and VL models. Both models delegate to this shared function with their respective embed_tokens and rotary_emb references.
MTP loss stays in model files: Following the pattern of other auxiliary losses in transformers (e.g., MoE router losses), MTP loss is computed within the model's forward pass rather than in a separate trainer class.
Backward compatible: With mtp_num_hidden_layers=0 (default), MTP is disabled and the models behave identically to before.
Checkpoint alignment: The MTP module structure aligns with the Qwen3.5 checkpoint format:
- mtp.pre_fc_norm_hidden.*
- mtp.pre_fc_norm_embedding.*
- mtp.fc.*
- mtp.layers.N.*
- mtp.norm.*

Testing

Tested with Qwen3.5-MTP model checkpoints to verify weight loading and loss computation.

Add MTP architecture and loss computation for Qwen3.5 models, enabling multi-token prediction during training for improved efficiency. Changes: - Add Qwen3_5MTPLayer and Qwen3_5MTP module classes - Add shared _compute_qwen35_mtp_loss() helper function - Add MTP support to Qwen3_5ForCausalLM (text-only model) - Add MTP support to Qwen3_5ForConditionalGeneration (VL model) - Add mtp_num_hidden_layers and mtp_loss_weight config fields - Remove mtp from _keys_to_ignore_on_load_unexpected in CausalLM - Regenerate modeling_qwen3_5.py and configuration_qwen3_5.py

github-actions · 2026-04-24T15:15:32Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen3_5

curnane-lab closed this Apr 24, 2026

curnane-lab deleted the feature/qwen35-mtp-clean branch April 24, 2026 15:22

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Multi-Token Prediction (MTP) support for Qwen3.5#45637

Add Multi-Token Prediction (MTP) support for Qwen3.5#45637
curnane-lab wants to merge 1 commit intohuggingface:mainfrom
curnane-lab:feature/qwen35-mtp-clean

curnane-lab commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

curnane-lab commented Apr 24, 2026