Skip to content

[Qwen3.5 MoE] Add _tp_plan to ForConditionalGeneration#45124

Merged
ArthurZucker merged 1 commit intohuggingface:mainfrom
danielquintas8:add-tp-plan-qwen3-5-moe
Apr 2, 2026
Merged

[Qwen3.5 MoE] Add _tp_plan to ForConditionalGeneration#45124
ArthurZucker merged 1 commit intohuggingface:mainfrom
danielquintas8:add-tp-plan-qwen3-5-moe

Conversation

@danielquintas8
Copy link
Copy Markdown
Contributor

@danielquintas8 danielquintas8 commented Mar 30, 2026

What does this PR do?

Adds _tp_plan = {"lm_head": "colwise_gather_output"} to Qwen3_5MoeForConditionalGeneration (the VL wrapper class).

The text-only Qwen3_5MoeForCausalLM already had _tp_plan, but the VL variant was missing it. This meant that when using tp_plan="auto", the lm_head on the VL model was not sharded — each GPU held a full copy and the all-gather behavior (colwise_gather_output) was not applied, which could produce incorrect logits under tensor parallelism.

Change: Applied in modular_qwen3_5_moe.py (source of truth) and regenerated modeling_qwen3_5_moe.py.

Already in place (no changes needed):

  • base_model_tp_plan on Qwen3_5MoeTextConfig covers full attention (q/k/v/o_proj, q/k_norm), MoE experts, and shared experts.

Out of scope (future work):

  • Linear attention (GatedDeltaNet) TP — blocked on causal_conv1d DTensor support.
  • Vision block TP — pending path resolution investigation.

Fixes #45125

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@3outeille @ArthurZucker (distributed / model loading)

The VL wrapper class was missing `_tp_plan`, so `lm_head` was not
sharded when using `tp_plan="auto"`. The text-only `ForCausalLM`
already had this; this aligns the conditional-generation (VL) variant.
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen3_5_moe

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes ! ty

@ArthurZucker ArthurZucker merged commit 57e8413 into huggingface:main Apr 2, 2026
19 checks passed
marvinzh pushed a commit to marvinzh/transformers that referenced this pull request Apr 3, 2026
…5124)

[Qwen3.5 MoE] Add `_tp_plan` to `Qwen3_5MoeForConditionalGeneration`

The VL wrapper class was missing `_tp_plan`, so `lm_head` was not
sharded when using `tp_plan="auto"`. The text-only `ForCausalLM`
already had this; this aligns the conditional-generation (VL) variant.
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Apr 4, 2026
…5124)

[Qwen3.5 MoE] Add `_tp_plan` to `Qwen3_5MoeForConditionalGeneration`

The VL wrapper class was missing `_tp_plan`, so `lm_head` was not
sharded when using `tp_plan="auto"`. The text-only `ForCausalLM`
already had this; this aligns the conditional-generation (VL) variant.
sirzechs66 pushed a commit to sirzechs66/transformers that referenced this pull request Apr 18, 2026
…5124)

[Qwen3.5 MoE] Add `_tp_plan` to `Qwen3_5MoeForConditionalGeneration`

The VL wrapper class was missing `_tp_plan`, so `lm_head` was not
sharded when using `tp_plan="auto"`. The text-only `ForCausalLM`
already had this; this aligns the conditional-generation (VL) variant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen3_5MoeForConditionalGeneration missing _tp_plan for tensor parallelism

2 participants