[Qwen3.5 MoE] Add _tp_plan to ForConditionalGeneration by danielquintas8 · Pull Request #45124 · huggingface/transformers

danielquintas8 · 2026-03-30T16:23:11Z

What does this PR do?

Adds _tp_plan = {"lm_head": "colwise_gather_output"} to Qwen3_5MoeForConditionalGeneration (the VL wrapper class).

The text-only Qwen3_5MoeForCausalLM already had _tp_plan, but the VL variant was missing it. This meant that when using tp_plan="auto", the lm_head on the VL model was not sharded — each GPU held a full copy and the all-gather behavior (colwise_gather_output) was not applied, which could produce incorrect logits under tensor parallelism.

Change: Applied in modular_qwen3_5_moe.py (source of truth) and regenerated modeling_qwen3_5_moe.py.

Already in place (no changes needed):

base_model_tp_plan on Qwen3_5MoeTextConfig covers full attention (q/k/v/o_proj, q/k_norm), MoE experts, and shared experts.

Out of scope (future work):

Linear attention (GatedDeltaNet) TP — blocked on causal_conv1d DTensor support.
Vision block TP — pending path resolution investigation.

Fixes #45125

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@3outeille @ArthurZucker (distributed / model loading)

The VL wrapper class was missing `_tp_plan`, so `lm_head` was not sharded when using `tp_plan="auto"`. The text-only `ForCausalLM` already had this; this aligns the conditional-generation (VL) variant.

github-actions · 2026-03-30T16:24:14Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen3_5_moe

ArthurZucker

yes ! ty

…5124) [Qwen3.5 MoE] Add `_tp_plan` to `Qwen3_5MoeForConditionalGeneration` The VL wrapper class was missing `_tp_plan`, so `lm_head` was not sharded when using `tp_plan="auto"`. The text-only `ForCausalLM` already had this; this aligns the conditional-generation (VL) variant.

[Qwen3.5 MoE] Add _tp_plan to Qwen3_5MoeForConditionalGeneration

c26c592

The VL wrapper class was missing `_tp_plan`, so `lm_head` was not sharded when using `tp_plan="auto"`. The text-only `ForCausalLM` already had this; this aligns the conditional-generation (VL) variant.

danielquintas8 mentioned this pull request Mar 30, 2026

Qwen3_5MoeForConditionalGeneration missing _tp_plan for tensor parallelism #45125

Closed

2 tasks

ArthurZucker approved these changes Apr 2, 2026

View reviewed changes

ArthurZucker merged commit 57e8413 into huggingface:main Apr 2, 2026
19 checks passed

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Qwen3.5 MoE] Add _tp_plan to ForConditionalGeneration#45124

[Qwen3.5 MoE] Add _tp_plan to ForConditionalGeneration#45124
ArthurZucker merged 1 commit intohuggingface:mainfrom
danielquintas8:add-tp-plan-qwen3-5-moe

danielquintas8 commented Mar 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danielquintas8 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

github-actions Bot commented Mar 30, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielquintas8 commented Mar 30, 2026 •

edited

Loading