encoder decoder model compile failed after refactor cache

### System Info

- `transformers` version: 4.55.0.dev0
- Platform: Linux-6.11.0-28-generic-x86_64-with-glibc2.35
- Python version: 3.11.13
- Huggingface_hub version: 0.34.2
- Safetensors version: 0.5.3
- Accelerate version: 1.8.1
- Accelerate config:    not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.9.0.dev20250714+cpu (NA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>


### Who can help?

@zucchini-nlp @ArthurZucker 

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

```python
import time
import requests
import torch
import PIL.Image
from transformers import pipeline

model_id = "nlpconnect/vit-gpt2-image-captioning"
image_to_text = pipeline("image-to-text", model=model_id, device="cpu", torch_dtype=torch.float16)
image_url = "https://ankur3107.github.io/assets/images/image-captioning-example.png"
image = PIL.Image.open(requests.get(image_url, stream=True, timeout=3000).raw)

for _ in range(10):
    output = image_to_text(image)

start = time.time()
output = image_to_text(image)
end = time.time()
print(f"eager mode pipeline latency {end - start}")

image_to_text.model.forward = torch.compile(image_to_text.model.forward)

for _ in range(10):
    output = image_to_text(image)

start = time.time()
output = image_to_text(image)
end = time.time()
print(f"compile mode pipeline latency {end - start}")
```

error log:
```
torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function scaled_dot_pr
oduct_attention>(*(FakeTensor(..., size=(1, 12, 1, 64), dtype=torch.float16), FakeTensor(..., size=(1, 12, 394, 64), dtype=torch.fl
oat16), FakeTensor(..., size=(1, 12, 394, 64), dtype=torch.float16)), **{'attn_mask': FakeTensor(..., size=(1, 1, 1, 197), dtype=to
rch.float16), 'dropout_p': 0.0, 'scale': None, 'is_causal': False}): got RuntimeError('Attempting to broadcast a dimension of lengt
h 197 at -1! Mismatching argument at index 1 had torch.Size([1, 1, 1, 197]); but expected shape should be broadcastable to [1, 12,
1, 394]')
```

### Expected behavior

Before the PR [38635](https://github.com/huggingface/transformers/pull/38635), the script runs well and can get 1.5x speed-up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoder decoder model compile failed after refactor cache #39746

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

encoder decoder model compile failed after refactor cache #39746

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions