fix(granitemoe*): Only create block_sparse_moe if num_local_experts > 0 by gabe-l-hart · Pull Request #42036 · huggingface/transformers

gabe-l-hart · 2025-11-05T13:56:45Z

Branch: GraniteMoeAsDenseFix

What does this PR do?

With the introduction of modular_granitemoe.py in #40132, the conditional that allowed GraniteMoe to also encapsulate dense models as a degenerate case was accidentally removed. This is never actually needed for the GraniteMoe architecture directly, but GraniteMoe is reused in GraniteMoeShared and then GraniteMoeHybrid which do need this ability to also encapsulate dense FFN blocks in place of the MoE block.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker I believe this came in with your PR for MoE in vLLM, so I'd love your sanity check on this fix.

gabe-l-hart · 2025-11-05T13:57:51Z

Looks like I need to regenerate the modeling_ code everywhere

gabe-l-hart · 2025-11-05T14:14:28Z

Interestingly, when I run make fix-copies, the python utils/check_modular_conversion.py --fix_and_overwrite script makes far more changes than I would expect from adding this conditional. In particular, for each architecture in the GraniteMoe* chain, it adds GraniteMoe<qualifier>SparseMoeBlock, then in the __init__ for GraniteMoe<qualifier>DecoderLayer, it adds self.block_sparse_moe = GraniteMoe<qualifier>SparseMoeBlock(config) (without any conditional) AND the guarded conditional block self.block_sparse_moe = GraniteMoeHybridMoE(config).

Looking a little deeper, this seems to be caused by the inheritance from modular_mixtral.py. I've added a clause that explicitly uses delattr to remove self.block_sparse_moe when not used, but that seems a bit backwards. Alternatively, we could not inherit from MixtralDecoderLayer or we could move the conditional up to MixtralDecoderLayer.__init__.

The part that I'm still confused about is why regenerating the modeling_* files is adding these SparseMoeBlock implementations at all. The inheritance from Mixtral was already there prior to my change, so I would expect that those should have already been added unless somehow making creation of self.block_sparse_moe conditional triggered logic in the generation to require that those be added?

gabe-l-hart · 2025-11-05T14:21:36Z

It appears that putting creation of self.block_sparse_moe behind a conditional does, in fact, trigger the inclusion of those SparseMoeBlock pieces in the generation. I've used an inline conditional now which seems to prevent this.

gabe-l-hart · 2025-11-05T14:27:30Z

🤦 Ok, all that was because of a bad copy-paste somewhere that had me creating a completely incorrect block for self.block_sparse_moe. Fixed and cleaned up history.

…erts > 0 Branch: GraniteMoeAsDenseFix Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: GraniteMoeAsDenseFix Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart · 2025-11-05T14:52:34Z

One more redo: Based on advice from @ArthurZucker, since there are no models using either GraniteMoe or GraniteMoeShared with the degenerate dense configuration, it's preferable to only have this conditional override in GraniteMoeHybrid where it is needed for various flavors of granite-4.0-* models.

gabe-l-hart · 2025-11-05T14:53:38Z

        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)


-class GraniteFlashAttentionKwargs(TypedDict, total=False):


It looks like these just got moved during the regeneration. I'm not sure if should be included (to enforce consistency with the generation script) or excluded (to minimize the size of the change).

Don't stress about the modeling file - if the modular is correct then make fix-copies should handle it all

Rocketknight1 · 2025-11-06T14:51:05Z

Hi @gabe-l-hart, thanks for the PR! You can get the code style tests to pass with pip install -e .[quality] followed by make fixup or make style.

Overall it looks good to me, and it does seem like the zero-experts case was accidentally deleted. Will wait for @ArthurZucker to confirm before merging!

gabe-l-hart · 2025-11-06T16:44:29Z

@Rocketknight1 Thanks! I'll get it cleaned up and hopefully green today

Branch: GraniteMoeAsDenseFix Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Rocketknight1 · 2025-11-07T16:51:45Z

Yep, looks good now! If you don't get core maintainer approval within a few days, ping me and I'll see what I can do

gabe-l-hart · 2025-11-20T15:38:19Z

@Rocketknight1 @ArthurZucker I lost track of this one. I just resolved the conflict with the GH web ui, but I didn't look to see if there are other changes that would conflict (ie this being solved at a higher level in the class hierarchy). Do we still need this PR to support non-hybrid/non-moe with GraniteMoeHybrid? If so, can we get this looked over?

github-actions · 2025-11-20T18:15:47Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: granitemoehybrid

Rocketknight1

Yes, I'm happy to approve now! The change makes sense, and given the existence of has_experts, the 0-expert path is clearly intended but just broken.

HuggingFaceDocBuilderDev · 2025-11-20T18:56:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

… 0 (huggingface#42036) * fix(granitemoehybid): Only set self.block_sparse_moe if num_local_experts > 0 Branch: GraniteMoeAsDenseFix Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(granitemoehybrid): Regenerate modeling_granitemoehybrid.py Branch: GraniteMoeAsDenseFix Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Fix import order Branch: GraniteMoeAsDenseFix Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * make fix-copies --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Matt <rocketknight1@gmail.com> Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

gabe-l-hart commented Nov 5, 2025

View reviewed changes

Comment thread src/transformers/models/granitemoehybrid/modeling_granitemoehybrid.py Outdated

gabe-l-hart commented Nov 5, 2025

View reviewed changes

Comment thread src/transformers/models/granitemoe/modeling_granitemoe.py Outdated

gabe-l-hart force-pushed the GraniteMoeAsDenseFix branch from e1e87db to be876f8 Compare November 5, 2025 14:26

gabe-l-hart force-pushed the GraniteMoeAsDenseFix branch from be876f8 to 1257ced Compare November 5, 2025 14:50

gabe-l-hart added 2 commits November 5, 2025 07:51

fix(granitemoehybid): Only set self.block_sparse_moe if num_local_exp…

7a6b4eb

…erts > 0 Branch: GraniteMoeAsDenseFix Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

fix(granitemoehybrid): Regenerate modeling_granitemoehybrid.py

f0cebbb

Branch: GraniteMoeAsDenseFix Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart force-pushed the GraniteMoeAsDenseFix branch from 1257ced to f0cebbb Compare November 5, 2025 14:51

gabe-l-hart commented Nov 5, 2025

View reviewed changes

style: Fix import order

c10b1cc

Branch: GraniteMoeAsDenseFix Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart added 2 commits November 10, 2025 10:09

Merge branch 'main' into GraniteMoeAsDenseFix

b8a50cf

Merge branch 'main' into GraniteMoeAsDenseFix

b5f118d

make fix-copies

340b546

Rocketknight1 approved these changes Nov 20, 2025

View reviewed changes

Merge branch 'main' into GraniteMoeAsDenseFix

f469bbf

Rocketknight1 enabled auto-merge (squash) November 20, 2025 18:44

Rocketknight1 merged commit a1afeca into huggingface:main Nov 20, 2025
17 checks passed

gabe-l-hart deleted the GraniteMoeAsDenseFix branch December 1, 2025 18:58

		return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)


		class GraniteFlashAttentionKwargs(TypedDict, total=False):

Conversation

gabe-l-hart commented Nov 5, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

Uh oh!

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

gabe-l-hart commented Nov 5, 2025

Uh oh!

gabe-l-hart Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 commented Nov 6, 2025

Uh oh!

gabe-l-hart commented Nov 6, 2025

Uh oh!

Rocketknight1 commented Nov 7, 2025

Uh oh!

gabe-l-hart commented Nov 20, 2025

Uh oh!

github-actions Bot commented Nov 20, 2025

Uh oh!

Rocketknight1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants