Fix executorch export with dynamic shapes by justinchuby · Pull Request #41559 · huggingface/transformers

justinchuby · 2025-10-14T03:33:50Z

What does this PR do?

This PR addd shape compatibility check for attention mask to ensure torch.export can reason about the logic without failing when exporting with dynamic shapes.

It additionally simplified the sdpa forward function for export only usage.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@Cyrilvallez @jackzhxng @guangy10

Add shape compatibility check for attention mask to ensure torch.export can reason about the logic without failing.

justinchuby · 2025-10-14T13:54:31Z

Is fx tracing still supported? It doesn’t seem to be compatible with torch._check here?

Cyrilvallez · 2025-10-15T08:52:28Z

Hey @justinchuby! Sorry but I don't get why this is needed. torch._check are very internal dynamo checks, and I don't see why we would have this here. Moreover, torch.export is only supported with a mask through the executorch integration, as otherwise the mask creation breaks due to using vmap

justinchuby · 2025-10-15T15:01:32Z

@Cyrilvallez thanks for the suggestion - I can test with the executorch integration. For context, we are looking to enable onnx export via torch.export.

Could you share guidance on a potential integration « onnx » directory next to the current executorch directory, for correct usages of onnx export calls? I assume we can follow a similar model with executorch where we have an integration and a separate optimum-onnx project.

torch.export recently resolved the vmap issues, and the torch._check was useful for the dynamic shapes engine of the tracer to understand equivalence. I agree that it should be part of the torch.export integration.

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

justinchuby · 2025-10-16T18:19:49Z

I updated the PR to fix the executorch integration. Please take another look. Thanks!

Cyrilvallez · 2025-10-17T14:55:44Z

torch.export recently resolved the vmap issues

Ouuhhh that's very very nice, I wasn't aware of it!

Cyrilvallez

Nice! I like this solution much more, as it's contained to export-specific functionalities only! However, I believe that we can simplify a lot!

Cyrilvallez · 2025-10-17T15:00:31Z

+def sdpa_attention_forward_for_export(
+    module: torch.nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    dropout: float = 0.0,
+    scaling: Optional[float] = None,
+    is_causal: Optional[bool] = None,
+    **kwargs,
+) -> tuple[torch.Tensor, None]:
+    # This is same as sdpa_attention_forward but simplified and added torch._check
+    # torch.export dynamic shapes support
+    if kwargs.get("output_attentions", False):
+        logger.warning_once(
+            "`sdpa` attention does not support `output_attentions=True`."
+            " Please set your attention to `eager` if you want any of these features."
+        )
+    sdpa_kwargs = {}
+    if hasattr(module, "num_key_value_groups"):
+        # Always use enable_gqa for grouped query attention which is supported by torch.export
+        sdpa_kwargs = {"enable_gqa": True}
+
+    if attention_mask is not None and attention_mask.ndim == 4:
+        attention_mask = attention_mask[:, :, :, : key.shape[-2]]
+        # torch._check used to inform torch.export of the shape relationship
+        torch._check(attention_mask.shape[-1] == key.shape[-2])
+
+    # We dispatch to SDPA's Flash Attention or Efficient kernels via this `is_causal` if statement instead of an inline conditional assignment
+    # in SDPA to support both torch.compile's dynamic shapes and full graph options. An inline conditional prevents dynamic shapes from compiling.
+    # Note that it is important to check first for the shape, otherwise compile will fail with `argument 'is_causal' must be bool, not SymBool`
+    if is_causal is None:
+        # The last condition is for encoder (decoder) models which specify this by passing their own `is_causal` flag
+        # This is mainly due to those models having mixed implementations for encoder, decoder, and encoder-decoder attns
+        is_causal = query.shape[2] > 1 and attention_mask is None and getattr(module, "is_causal", True)
+
+    attn_output = torch.nn.functional.scaled_dot_product_attention(
+        query,
+        key,
+        value,
+        attn_mask=attention_mask,
+        dropout_p=dropout,
+        scale=scaling,
+        is_causal=is_causal,
+        **sdpa_kwargs,
+    )
+    attn_output = attn_output.transpose(1, 2).contiguous()
+
+    return attn_output, None


Can we instead simply do the torch._check and then call sdpa_attention_forward? Would be much simpler!

The check is needed because of this line attention_mask = attention_mask[:, :, :, : key.shape[-2]] Do you have suggestion on how it can be avoided or moved out maybe? The check needs to follow this slice operation. The reason is that the when slicing and when key.shape[-2] is dynamic, torch.export wouldn't know if it is actually getting the exact full slice.

I also simplified the gqa logic to always use enable_gqa instead of repeat_interleave because the option is now supported.

Ohhh, I see - actually this slicing ops should not be here at all, i.e. the mask should be correctly prepared upstream, which is the case for all recent models using the nice mask primitives.
I wanted to check if we still have older models for which it's still necessary at some point, here and in the eager_attention_forward attentions.
And if you start by slicing the mask, then _check, and then call sdpa_attention_forward, would it work? It would get re-sliced in the call, but to the exact same length, not sure if the _check would be lost then?

And if you start by slicing the mask, then _check, and then call sdpa_attention_forward, would it work?

I can check that - but it is going to create a duplicated slice op which is not ideal. I also want to make sure enable_gqa is used. Maybe for that we can update sdpa_attention_forward to always use enable_gqa instead?

If we can get slicing out from the forward function, and ensure enable_gqa when exporting, then this patch can be simplified to _check then call sdpa_attention_forward

What condition would you suggest I change to? Thanks

I don't have a good suggestion tbh, is there a way to tell we are tracing with export only? The issue is that we really can't expect people to only use export and falling back to the math kernel is likely more expensive than the manual repeats.

Maybe I can instead remove the slice and see what breaks?

Responding here but saw the other PR. I think that's the right way but let's wait on @Cyrilvallez to come back (next week).

Yes, we can use torch.compiler.is_exporting() to check that.

Then let's add a condition on the gqa function (i.e. here

transformers/src/transformers/integrations/sdpa_attention.py

Line 30 in 02c324f

def use_gqa_in_sdpa(attention_mask: Optional[torch.Tensor], key: torch.Tensor) -> bool:

) to allow GQA in any case when we detect exporting. Is it limited to a specific version of export? Might need double check the torch version.

Side note and semi relevant: the masks are refactored a bit in #41852 so we won't need a workaround for export in the future.

Great! I can create a separate update for use_gqa_in_sdpa. The most important fix has been removing the slice on masks.

Cyrilvallez · 2025-10-17T15:04:22Z

BTW, we will probably add a non-vmap path to the sdpa mask creation due to vmap being much slower, and the impact being noticeable for small models (#41639). This PR is still welcome though, as its focus is on the sdpa attention, not sdpa mask, but I thought I'd mention it just in case!

Cyrilvallez · 2025-10-17T15:05:35Z

Is fx tracing still supported? It doesn’t seem to be compatible with torch._check here?

Concerning this, the answer is no: #41683!

jiqing-feng · 2025-10-23T07:52:02Z

I've tried this PR and found it didn't solve the performance regression in #41639 . Will it be fixed in the next PR?

cc @justinchuby @Cyrilvallez

justinchuby · 2025-10-24T13:17:14Z

I've tried this PR and found it didn't solve the performance regression in #41639 . Will it be fixed in the next PR?

cc @justinchuby @Cyrilvallez

I believe this PR is unrelated to the issue you linked

justinchuby · 2025-10-28T21:26:03Z

I think we can merge #41900, then I will create a separate PR to guard torch.export on dynamic shapes.

Ensure attention mask shape matches key tensor

90aaee3

Add shape compatibility check for attention mask to ensure torch.export can reason about the logic without failing.

justinchuby added 2 commits October 14, 2025 06:55

Update error message for attention mask shape check

6bcafa1

Refactor attention mask shape check for clarity

2ecde52

justinchuby added 8 commits October 15, 2025 19:27

Merge branch 'huggingface:main' into patch-2

ba1bce7

Add test

a4dd4f9

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

Merge branch 'main' into patch-2

8400f30

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

doc

52920a4

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

logger

c0939af

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

sdpa_attention_forward_for_export

5f9ebf4

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

Fix check

8060870

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

Fix

5e1de41

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

justinchuby changed the title ~~Ensure attention mask shape matches key tensor~~ Fix executorch export with dynamic shapes Oct 16, 2025

Merge branch 'main' into patch-2

9127b6c

Cyrilvallez reviewed Oct 17, 2025

View reviewed changes

justinchuby marked this pull request as draft October 28, 2025 21:25

justinchuby commented Oct 28, 2025

View reviewed changes

Comment thread tests/test_executorch.py

vasqu mentioned this pull request Oct 30, 2025

Remove unnecessary slicing in sdpa_attention_forward #41900

Merged

justinchuby closed this Jan 22, 2026

Conversation

justinchuby commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

justinchuby commented Oct 14, 2025

Uh oh!

Cyrilvallez commented Oct 15, 2025

Uh oh!

justinchuby commented Oct 15, 2025

Uh oh!

justinchuby commented Oct 16, 2025

Uh oh!

Cyrilvallez commented Oct 17, 2025

Uh oh!

Cyrilvallez left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez commented Oct 17, 2025

Uh oh!

Cyrilvallez commented Oct 17, 2025

Uh oh!

jiqing-feng commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

justinchuby commented Oct 24, 2025

Uh oh!

justinchuby commented Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

justinchuby commented Oct 14, 2025 •

edited

Loading

Cyrilvallez left a comment •

edited

Loading

vasqu Oct 30, 2025 •

edited

Loading

jiqing-feng commented Oct 23, 2025 •

edited

Loading