Fix GraniteMoeHybrid _update_mamba_mask crash on attention-only models#45514
Conversation
When all layers are attention layers (no mamba layers), _update_mamba_mask calls past_key_values.has_previous_state() which tries to find a LinearAttentionCacheLayerMixin layer. Since none exist, it raises ValueError. Skip the has_previous_state check entirely when the model has no mamba layers, as the mamba mask optimization is irrelevant in that case. Fixes huggingface#45507
|
cc @vasqu maybe since you volunteered! |
vasqu
left a comment
There was a problem hiding this comment.
Please see my comment and also add a small test for granite moe hybrid in this case
Instead of guarding inside _update_mamba_mask, skip the call entirely in forward() when no mamba layers exist. This keeps _update_mamba_mask focused on its original responsibility and avoids calling it on attention-only models altogether. Signed-off-by: root <root@hk760245497450.local>
|
Thanks for the suggestion @vasqu — you're right that the guard belongs at the call site rather than inside Updated in ec7cd01: the check is now in
|
|
Let's add a fast test as well please; should be easy by forcing the layer types on construction of the dummy model |
Verifies that GraniteMoeHybrid models with all attention layers (no mamba layers) can run forward without crashing. Regression test for huggingface#45507. Signed-off-by: root <root@hk760245497450.local>
|
[For maintainers] Suggested jobs to run (before merge) run-slow: granitemoehybrid |
vasqu
left a comment
There was a problem hiding this comment.
Thanks, I only changed the logic in the model a bit to be closer to hybrid attention patterns - checking with run-slow in a sec, if everything passes / doesn't change, I merge
|
run-slow: granitemoehybrid |
|
This comment contains models: ["models/granitemoehybrid"] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
#45514) * Fix GraniteMoeHybrid _update_mamba_mask crash on attention-only models When all layers are attention layers (no mamba layers), _update_mamba_mask calls past_key_values.has_previous_state() which tries to find a LinearAttentionCacheLayerMixin layer. Since none exist, it raises ValueError. Skip the has_previous_state check entirely when the model has no mamba layers, as the mamba mask optimization is irrelevant in that case. Fixes #45507 * Move mamba layer guard to forward() caller per review feedback Instead of guarding inside _update_mamba_mask, skip the call entirely in forward() when no mamba layers exist. This keeps _update_mamba_mask focused on its original responsibility and avoids calling it on attention-only models altogether. Signed-off-by: root <root@hk760245497450.local> * Add fast test for attention-only model forward pass Verifies that GraniteMoeHybrid models with all attention layers (no mamba layers) can run forward without crashing. Regression test for #45507. Signed-off-by: root <root@hk760245497450.local> * fixup closer to hybrid attentions --------- Signed-off-by: root <root@hk760245497450.local> Co-authored-by: root <root@hk760245497450.local> Co-authored-by: vasqu <antonprogamer@gmail.com>
Fixes #45507
Summary
GraniteMoeHybridModel._update_mamba_maskcallspast_key_values.has_previous_state()without checking whether the model actually has mamba layers. When all layers are attention-only (no mamba layers inconfig.layers_block_type),has_previous_state()fails to find aLinearAttentionCacheLayerMixinlayer and raisesValueError.Fix
Check
config.layers_block_typefor mamba layers before callinghas_previous_state(). If no mamba layers exist, return the attention mask as-is since the mamba mask optimization is irrelevant.Applied to both
modeling_granitemoehybrid.pyandmodular_granitemoehybrid.py.