Skip to content

Fix Mixtral aux_loss not computed when output_router_logits=False#44586

Closed
mvanhorn wants to merge 1 commit intohuggingface:mainfrom
mvanhorn:osc/44242-fix-mixtral-aux-loss
Closed

Fix Mixtral aux_loss not computed when output_router_logits=False#44586
mvanhorn wants to merge 1 commit intohuggingface:mainfrom
mvanhorn:osc/44242-fix-mixtral-aux-loss

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

What does this PR do?

Decouples router logits collection from output visibility in Mixtral's ForCausalLM. Previously, output_router_logits=False (the default) prevented aux_loss from being computed, meaning load balancing was silently disabled during training even when router_aux_loss_coef > 0.

The fix:

  • Always collect router logits internally when router_aux_loss_coef > 0
  • Always compute aux_loss when router logits are available
  • Only include router_logits in the model output when the user explicitly sets output_router_logits=True

This affects all MoE models inheriting from Mixtral via modular conversion: ernie4_5_moe, flex_olmo, gpt_oss, jamba, minimax, minimax_m2, olmoe, phimoe, qwen2_moe, qwen3_5_moe, qwen3_next.

Fixes #44242

Before submitting

Who can review?

@SunMarc @ArthurZucker @Cyrilvallez (MoE models, training)

This contribution was developed with AI assistance (Claude Code).

Decouple router logits collection from output visibility. When
router_aux_loss_coef > 0, always collect router logits internally
to compute aux_loss during training, regardless of the
output_router_logits setting. Only include router_logits in the
model output when output_router_logits=True.

This fix propagates to all MoE models inheriting from Mixtral via
modular conversion (ernie4_5_moe, flex_olmo, gpt_oss, jamba,
minimax, olmoe, phimoe, qwen2_moe, qwen3_5_moe, qwen3_next).

Fixes huggingface#44242

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: ernie4_5_moe, flex_olmo, gpt_oss, jamba, minimax, minimax_m2, mixtral, olmoe, phimoe, qwen2_moe, qwen3_5_moe, qwen3_next

@Rocketknight1
Copy link
Copy Markdown
Member

cc @vasqu as well - I see so many issues and PRs about this and it would be great to finally resolve it

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented Mar 11, 2026

It's a docs issue atp, e.g. see #44242 (comment), this seems like code agent slop 👀

#44264 would be a serious solution but it's a draft for a reason

@Rocketknight1
Copy link
Copy Markdown
Member

Sorry for the ping in that case! @mvanhorn be careful opening a lot of code agent PRs - although some of them do turn out to be helpful, the chance of slop causing confusion and wasting time is pretty high.

@mvanhorn
Copy link
Copy Markdown
Contributor Author

Fair point - I got a little excited about landing my first PR and jumped to a code change without reading the issue thread carefully enough. The discussion already converged on this being a docs clarification, and @vasqu has a broader approach in #44264.

@Rocketknight1 noted on the volume - I'll be more selective going forward and make sure I understand the maintainer consensus before submitting. Just trying to add value to the team here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Load balancing loss not added when output_router_logits=False

3 participants