Skip to content

fix(granitemoe*): Only create block_sparse_moe if num_local_experts > 0#42036

Merged
Rocketknight1 merged 7 commits intohuggingface:mainfrom
gabe-l-hart:GraniteMoeAsDenseFix
Nov 20, 2025
Merged

fix(granitemoe*): Only create block_sparse_moe if num_local_experts > 0#42036
Rocketknight1 merged 7 commits intohuggingface:mainfrom
gabe-l-hart:GraniteMoeAsDenseFix

Conversation

@gabe-l-hart
Copy link
Copy Markdown
Contributor

Branch: GraniteMoeAsDenseFix

What does this PR do?

With the introduction of modular_granitemoe.py in #40132, the conditional that allowed GraniteMoe to also encapsulate dense models as a degenerate case was accidentally removed. This is never actually needed for the GraniteMoe architecture directly, but GraniteMoe is reused in GraniteMoeShared and then GraniteMoeHybrid which do need this ability to also encapsulate dense FFN blocks in place of the MoE block.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker I believe this came in with your PR for MoE in vLLM, so I'd love your sanity check on this fix.

@gabe-l-hart
Copy link
Copy Markdown
Contributor Author

Looks like I need to regenerate the modeling_ code everywhere

@gabe-l-hart
Copy link
Copy Markdown
Contributor Author

Interestingly, when I run make fix-copies, the python utils/check_modular_conversion.py --fix_and_overwrite script makes far more changes than I would expect from adding this conditional. In particular, for each architecture in the GraniteMoe* chain, it adds GraniteMoe<qualifier>SparseMoeBlock, then in the __init__ for GraniteMoe<qualifier>DecoderLayer, it adds self.block_sparse_moe = GraniteMoe<qualifier>SparseMoeBlock(config) (without any conditional) AND the guarded conditional block self.block_sparse_moe = GraniteMoeHybridMoE(config).

Looking a little deeper, this seems to be caused by the inheritance from modular_mixtral.py. I've added a clause that explicitly uses delattr to remove self.block_sparse_moe when not used, but that seems a bit backwards. Alternatively, we could not inherit from MixtralDecoderLayer or we could move the conditional up to MixtralDecoderLayer.__init__.

The part that I'm still confused about is why regenerating the modeling_* files is adding these SparseMoeBlock implementations at all. The inheritance from Mixtral was already there prior to my change, so I would expect that those should have already been added unless somehow making creation of self.block_sparse_moe conditional triggered logic in the generation to require that those be added?

@gabe-l-hart
Copy link
Copy Markdown
Contributor Author

It appears that putting creation of self.block_sparse_moe behind a conditional does, in fact, trigger the inclusion of those SparseMoeBlock pieces in the generation. I've used an inline conditional now which seems to prevent this.

Comment thread src/transformers/models/granitemoehybrid/modeling_granitemoehybrid.py Outdated
Comment thread src/transformers/models/granitemoe/modeling_granitemoe.py Outdated
@gabe-l-hart
Copy link
Copy Markdown
Contributor Author

🤦 Ok, all that was because of a bad copy-paste somewhere that had me creating a completely incorrect block for self.block_sparse_moe. Fixed and cleaned up history.

…erts > 0

Branch: GraniteMoeAsDenseFix

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: GraniteMoeAsDenseFix

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@gabe-l-hart
Copy link
Copy Markdown
Contributor Author

One more redo: Based on advice from @ArthurZucker, since there are no models using either GraniteMoe or GraniteMoeShared with the degenerate dense configuration, it's preferable to only have this conditional override in GraniteMoeHybrid where it is needed for various flavors of granite-4.0-* models.

return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)


class GraniteFlashAttentionKwargs(TypedDict, total=False):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these just got moved during the regeneration. I'm not sure if should be included (to enforce consistency with the generation script) or excluded (to minimize the size of the change).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't stress about the modeling file - if the modular is correct then make fix-copies should handle it all

@Rocketknight1
Copy link
Copy Markdown
Member

Hi @gabe-l-hart, thanks for the PR! You can get the code style tests to pass with pip install -e .[quality] followed by make fixup or make style.

Overall it looks good to me, and it does seem like the zero-experts case was accidentally deleted. Will wait for @ArthurZucker to confirm before merging!

@gabe-l-hart
Copy link
Copy Markdown
Contributor Author

@Rocketknight1 Thanks! I'll get it cleaned up and hopefully green today

Branch: GraniteMoeAsDenseFix

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@Rocketknight1
Copy link
Copy Markdown
Member

Yep, looks good now! If you don't get core maintainer approval within a few days, ping me and I'll see what I can do

@gabe-l-hart
Copy link
Copy Markdown
Contributor Author

@Rocketknight1 @ArthurZucker I lost track of this one. I just resolved the conflict with the GH web ui, but I didn't look to see if there are other changes that would conflict (ie this being solved at a higher level in the class hierarchy). Do we still need this PR to support non-hybrid/non-moe with GraniteMoeHybrid? If so, can we get this looked over?

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: granitemoehybrid

Copy link
Copy Markdown
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm happy to approve now! The change makes sense, and given the existence of has_experts, the 0-expert path is clearly intended but just broken.

@Rocketknight1 Rocketknight1 enabled auto-merge (squash) November 20, 2025 18:44
@Rocketknight1 Rocketknight1 merged commit a1afeca into huggingface:main Nov 20, 2025
17 checks passed
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@gabe-l-hart gabe-l-hart deleted the GraniteMoeAsDenseFix branch December 1, 2025 18:58
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
… 0 (huggingface#42036)

* fix(granitemoehybid): Only set self.block_sparse_moe if num_local_experts > 0

Branch: GraniteMoeAsDenseFix

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(granitemoehybrid): Regenerate modeling_granitemoehybrid.py

Branch: GraniteMoeAsDenseFix

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Fix import order

Branch: GraniteMoeAsDenseFix

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* make fix-copies

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Matt <rocketknight1@gmail.com>
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants