[llama4] Inference with multiple GPU torch.distributed.DistStoreError

### System Info

- `transformers` version: 4.52.0.dev0
- Platform: Linux-5.15.0-1032-oracle-x86_64-with-glibc2.31
- Python version: 3.10.16
- Huggingface_hub version: 0.30.1
- Safetensors version: 0.5.2
- Accelerate version: 1.3.0
- Accelerate config: 	not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.6.0+cu124 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: Yes
- Using GPU in script?: Yes
- GPU type: NVIDIA A100-SXM4-80GB x 8

### Who can help?

@ArthurZucker 

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

I am using the official example of llama4 with the latest transformer code:

```
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch


model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url1},
            {"type": "image", "url": url2},
            {
                "type": "text",
                "text": "Can you describe how these two images are similar, and how they differ?",
            },
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)

response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1] :])[0]
print(response)
print(outputs[0])
```

I got the following error:
```
torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/8 clients joined. 
OSError: We tried to initialize torch.distributed for you, but it failed, makesure you init torch distributed in your script to use `tp_plan='auto'`
```

### Expected behavior

Text output from the model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llama4] Inference with multiple GPU torch.distributed.DistStoreError #37355

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[llama4] Inference with multiple GPU torch.distributed.DistStoreError #37355

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions