Skip to content

Fix softmaxing router logits#45315

Closed
Rocketknight1 wants to merge 1 commit intomainfrom
fix/double-softmax-moe-router
Closed

Fix softmaxing router logits#45315
Rocketknight1 wants to merge 1 commit intomainfrom
fix/double-softmax-moe-router

Conversation

@Rocketknight1
Copy link
Copy Markdown
Member

Reusing a variable name meant that we returned a softmaxed value instead of the original logits in some MoE routers. This generally did not affect inference, but could affect the auxiliary loss on MoE logits in training when the coefficient for that loss was > 0. First reported in #43542. Fixes #45120

@Rocketknight1 Rocketknight1 marked this pull request as ready for review April 8, 2026 12:54
@Rocketknight1
Copy link
Copy Markdown
Member Author

(Doing the fix in a separate PR because the existing PRs were stale or had rebase issues)

@Rocketknight1
Copy link
Copy Markdown
Member Author

cc @vasqu - this has popped up in multiple issues since January so it'd be good to finally put it to rest!

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1 Rocketknight1 force-pushed the fix/double-softmax-moe-router branch from 33dd54b to b1c8150 Compare April 8, 2026 15:03
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 8, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: flex_olmo, minimax, mixtral, olmoe, qwen2_moe, qwen3_5_moe, qwen3_moe, qwen3_next, qwen3_omni_moe, qwen3_vl_moe

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 9, 2026

Answered on the other PR but replying here just for viz, that really has become messy 😬 sorry

@Rocketknight1
Copy link
Copy Markdown
Member Author

Yep, closing in favour of the other PR to keep things neater!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Double softmax in MoE router load-balancing loss (mixtral, qwen2_moe, qwen3_vl_moe families)

3 participants