Skip to content

Fix MoE routers returning probabilities instead of logits#45131

Merged
ArthurZucker merged 2 commits intohuggingface:mainfrom
yacinemebarki:fix-moe-router-logits
Apr 13, 2026
Merged

Fix MoE routers returning probabilities instead of logits#45131
ArthurZucker merged 2 commits intohuggingface:mainfrom
yacinemebarki:fix-moe-router-logits

Conversation

@yacinemebarki
Copy link
Copy Markdown
Contributor

@yacinemebarki yacinemebarki commented Mar 30, 2026

What does this PR do?

Fixes issue #45120: Several MoE routers returned softmaxed probabilities as router_logits, which caused load_balancing_loss_func to compute softmax(softmax(logits)), flattening routing distributions and weakening gradient signals during fine-tuning. This PR fixes it by keeping router_logits as raw logits and computing router_probs separately for top-k routing.


Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by code agents. We are currently bottlenecked by our ability to review and respond to them. As a result, we ask that new users do not submit pure code agent PRs at this time.

  • I confirm that this is not a pure code agent PR.

For more information, please read CONTRIBUTING.md.


Before submitting

  • This PR fixes the issue and does not break inference
  • Did you read the contributor guideline?
  • Documentation does not need changes, as behavior/API is unchanged
  • No new tests are required; logic change only affects auxiliary loss

How to test / verify

  1. Train a model using output_router_logits=True.
  2. Check that router_logits returned by routers are raw logits, not probabilities.
  3. Verify that load_balancing_loss_func receives proper logits and computes meaningful gradients.
  4. Confirm that inference behavior is unchanged.

Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Careful approval but this is correct in my eyes, we definitely shouldn't have a double softmax imo.

2 things

  1. You need to apply this to actual modeling files and their inheritances via make fix-repo
  2. We need to adjust our test to catch this, i.e. let's add another check in the test_load_balancing_loss tests (we should be easily check via sum != 1 which should be hard to achieve without a softmax)

Also cc @ArthurZucker since you have the most exp here with the moe loss

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Mar 31, 2026

@yacinemebarki you applied the template a bit wrong and basically pinged every member of the transformers repo. Please be careful about this the next time; for now I commented it out again 🤗

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 1, 2026

Looks like a bad merge/rebase happened 😓

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Apr 9, 2026

Will likely be superceded by #45346

@ArthurZucker ArthurZucker force-pushed the fix-moe-router-logits branch from 8c85a85 to f482fa7 Compare April 13, 2026 09:04
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: flex_olmo, minimax, mixtral, olmoe, qwen2_moe, qwen3_5_moe, qwen3_moe, qwen3_next, qwen3_omni_moe, qwen3_vl_moe

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's go

@ArthurZucker ArthurZucker added this pull request to the merge queue Apr 13, 2026
Merged via the queue into huggingface:main with commit d294169 Apr 13, 2026
22 checks passed
sirzechs66 pushed a commit to sirzechs66/transformers that referenced this pull request Apr 18, 2026
…e#45131)

* Fix MoE routers returning probabilities instead of logits

* Propagate modular fix to modeling files via make fix-repo

---------

Co-authored-by: Arthur <arthur.zucker@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants