Fix Seq2SeqLM ExecuTorch export: add encoder_attention_mask to decoder and use static encoder shapes#45523
Open
duyhv-qualgo wants to merge 2 commits intohuggingface:mainfrom
Conversation
…r and use static encoder shapes Two related bugs in the seq2seq ExecuTorch export path: 1. `Seq2SeqLMDecoderExportableModuleWithStaticCache.forward` did not pass `encoder_attention_mask` to the decoder stack. For T5 (and any model using relative position bias scaled by key_length), omitting this mask causes the bias to be computed over the full padded sequence length rather than the real encoder length, producing ~20× logit scale errors and wrong greedy-decoding outputs. 2. `Seq2SeqLMExportableModule._export_decoder` marked `encoder_hidden_states` dim-1 as dynamic (`encoder_hidden_seq_length`). With transformers 5.0 the static KV-cache size is a compile-time constant; a symbolic encoder dim creates a shape conflict during `torch.export` for models like T5 that slice the cross-attention causal mask against the cache size. Fix: - Add optional `encoder_attention_mask` parameter to `Seq2SeqLMDecoderExportableModuleWithStaticCache.forward` and thread it through to `self.decoder(...)`. - Remove the dynamic encoder dim in `_export_decoder`; callers are expected to pad encoder inputs to `max_cache_len` (the static export shape). - Update `Seq2SeqLMExportableModule.export()` and `generate()` to build and pass the encoder attention mask automatically.
ad2e86c to
373f55c
Compare
Contributor
|
@Cyrilvallez could you check it out |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two related bugs in
src/transformers/integrations/executorch.pythat break seq2seq (T5) ExecuTorch export:Bug 1 —
encoder_attention_masknot forwarded to decoderSeq2SeqLMDecoderExportableModuleWithStaticCache.forwardcallsself.decoder(...)without passingencoder_attention_mask. For T5, the cross-attention module computes a relative position bias scaled bykey_length. When no mask is provided,key_lengthequals the full padded sequence length (e.g. 512) instead of the real encoder output length. This causes a ~20× logit scale error and completely wrong greedy-decoding outputs.Verified: exporting and running T5 without this fix produces semantically wrong translations. With the fix, ExecuTorch output matches HuggingFace
model.generate()exactly (5/5 test cases, exact string match).Bug 2 — Dynamic encoder dim conflicts with static KV cache
Seq2SeqLMExportableModule._export_decodermarksencoder_hidden_statesdim-1 as a dynamic symbol (encoder_hidden_seq_length). With transformers 5.0, T5's cross-attention causal mask slices against the static KV-cache size:This raises
RuntimeError: tensor a (1024) must match tensor b (s96)duringtorch.export.Fix: remove the dynamic encoder dim; callers pad encoder inputs to
max_cache_len(the static export shape), which is the correct assumption for static-shape ExecuTorch deployment.Changes
Seq2SeqLMDecoderExportableModuleWithStaticCache.forward: addencoder_attention_mask: Tensor | None = Noneparameter and pass it toself.decoder(...).Seq2SeqLMExportableModule._export_decoder: accept and passencoder_attention_mask; remove dynamic encoder sequence length shape (usedynamic_shapes=None).Seq2SeqLMExportableModule.export(): passencoder_attention_mask(default: all-ones of shape[batch, max_cache_len]).Seq2SeqLMExportableModule.generate(): buildencoder_attention_maskfromprompt_token_ids != 0and pass it to each decoder step.Test plan
model.generate()on 5 translation test cases after this fixtests/models/test_modeling_t5.py@slowtests (if CI can run them)tests/integrations/test_executorch.py— no seq2seq-specific tests exist yet; a follow-up can add oneWho can review?
cc @vasqu @Cyrilvallez