Fix MoE routers returning probabilities instead of logits#45131
Fix MoE routers returning probabilities instead of logits#45131ArthurZucker merged 2 commits intohuggingface:mainfrom
Conversation
vasqu
left a comment
There was a problem hiding this comment.
Careful approval but this is correct in my eyes, we definitely shouldn't have a double softmax imo.
2 things
- You need to apply this to actual modeling files and their inheritances via
make fix-repo - We need to adjust our test to catch this, i.e. let's add another check in the
test_load_balancing_losstests (we should be easily check via sum != 1 which should be hard to achieve without a softmax)
Also cc @ArthurZucker since you have the most exp here with the moe loss
|
@yacinemebarki you applied the template a bit wrong and basically pinged every member of the transformers repo. Please be careful about this the next time; for now I commented it out again 🤗 |
|
Looks like a bad merge/rebase happened 😓 |
|
Will likely be superceded by #45346 |
8c85a85 to
f482fa7
Compare
|
[For maintainers] Suggested jobs to run (before merge) run-slow: flex_olmo, minimax, mixtral, olmoe, qwen2_moe, qwen3_5_moe, qwen3_moe, qwen3_next, qwen3_omni_moe, qwen3_vl_moe |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…e#45131) * Fix MoE routers returning probabilities instead of logits * Propagate modular fix to modeling files via make fix-repo --------- Co-authored-by: Arthur <arthur.zucker@gmail.com>
What does this PR do?
Fixes issue #45120: Several MoE routers returned softmaxed probabilities as
router_logits, which causedload_balancing_loss_functo compute softmax(softmax(logits)), flattening routing distributions and weakening gradient signals during fine-tuning. This PR fixes it by keepingrouter_logitsas raw logits and computingrouter_probsseparately for top-k routing.Code Agent Policy
The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by code agents. We are currently bottlenecked by our ability to review and respond to them. As a result, we ask that new users do not submit pure code agent PRs at this time.
For more information, please read CONTRIBUTING.md.
Before submitting
How to test / verify
output_router_logits=True.router_logitsreturned by routers are raw logits, not probabilities.load_balancing_loss_funcreceives proper logits and computes meaningful gradients.