Fix re-compilations for cross attention cache#39788
Fix re-compilations for cross attention cache#39788zucchini-nlp merged 1 commit intohuggingface:mainfrom
Conversation
|
[For maintainers] Suggested jobs to run (before merge) run-slow: autoformer, bert, bert_generation, big_bird, bigbird_pegasus, blip, bridgetower, camembert, data2vec, electra, ernie, fsmt, gpt_bigcode, imagegpt, kosmos2, led |
manueldeprada
left a comment
There was a problem hiding this comment.
lgtm, sorry!! these changes got lost when cherry-picking back and forth between the layer[i].keys and key_cache[i] designs in the original PR😭
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
No worries, that happens 😄 Let me see if I can add encoder-decoder compile test easily in this PR or if we need to handle a lot of edge cases EDIT: oh these aren't generative models/can't compile fullgraph and we don't have graph-break test for those models yet. That's why it wasn't caught in CI |
fix recompilations for cross attn cache
fix recompilations for cross attn cache
fix recompilations for cross attn cache
fix recompilations for cross attn cache
fix recompilations for cross attn cache
fix recompilations for cross attn cache
fix recompilations for cross attn cache
What does this PR do?
Fixes #39774.
As per title, if we are using the legacy
cache.key_cache[layer_idx]a warning is emitted and fullgraph compilation breaks. This PR makes sure no warning are raised when using the models in core library