Fix Mixtral aux_loss not computed when output_router_logits=False#44586
Fix Mixtral aux_loss not computed when output_router_logits=False#44586mvanhorn wants to merge 1 commit intohuggingface:mainfrom
Conversation
Decouple router logits collection from output visibility. When router_aux_loss_coef > 0, always collect router logits internally to compute aux_loss during training, regardless of the output_router_logits setting. Only include router_logits in the model output when output_router_logits=True. This fix propagates to all MoE models inheriting from Mixtral via modular conversion (ernie4_5_moe, flex_olmo, gpt_oss, jamba, minimax, olmoe, phimoe, qwen2_moe, qwen3_5_moe, qwen3_next). Fixes huggingface#44242 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
[For maintainers] Suggested jobs to run (before merge) run-slow: ernie4_5_moe, flex_olmo, gpt_oss, jamba, minimax, minimax_m2, mixtral, olmoe, phimoe, qwen2_moe, qwen3_5_moe, qwen3_next |
|
cc @vasqu as well - I see so many issues and PRs about this and it would be great to finally resolve it |
|
It's a docs issue atp, e.g. see #44242 (comment), this seems like code agent slop 👀 #44264 would be a serious solution but it's a draft for a reason |
|
Sorry for the ping in that case! @mvanhorn be careful opening a lot of code agent PRs - although some of them do turn out to be helpful, the chance of slop causing confusion and wasting time is pretty high. |
|
Fair point - I got a little excited about landing my first PR and jumped to a code change without reading the issue thread carefully enough. The discussion already converged on this being a docs clarification, and @vasqu has a broader approach in #44264. @Rocketknight1 noted on the volume - I'll be more selective going forward and make sure I understand the maintainer consensus before submitting. Just trying to add value to the team here. |
What does this PR do?
Decouples router logits collection from output visibility in Mixtral's
ForCausalLM. Previously,output_router_logits=False(the default) preventedaux_lossfrom being computed, meaning load balancing was silently disabled during training even whenrouter_aux_loss_coef > 0.The fix:
router_aux_loss_coef > 0aux_losswhen router logits are availablerouter_logitsin the model output when the user explicitly setsoutput_router_logits=TrueThis affects all MoE models inheriting from Mixtral via modular conversion: ernie4_5_moe, flex_olmo, gpt_oss, jamba, minimax, minimax_m2, olmoe, phimoe, qwen2_moe, qwen3_5_moe, qwen3_next.
Fixes #44242
Before submitting
Pull Request section?
to it if that's the case. Load balancing loss not added when output_router_logits=False #44242
Who can review?
@SunMarc @ArthurZucker @Cyrilvallez (MoE models, training)
This contribution was developed with AI assistance (Claude Code).