Different LlamaRotaryEmbedding in old and new versions of transformers

### System Info

Two versions of transformers:
========= NEW VERSION ==============
- `transformers` version: 4.46.1
- Platform: Linux-5.15.0-1044-nvidia-x86_64-with-glibc2.35
- Python version: 3.11.10
- Huggingface_hub version: 0.23.3
- Safetensors version: 0.4.3
- Accelerate version: 0.32.1
- Accelerate config:    not found
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA H100 80GB HBM3
=========== OLD VERSION =====================
- `transformers` version: 4.34.1
- Platform: Linux-5.15.0-1044-nvidia-x86_64-with-glibc2.35
- Python version: 3.11.10
- Huggingface_hub version: 0.17.3
- Safetensors version: 0.4.3
- Accelerate version: 0.20.3
- Accelerate config:    not found
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>


### Who can help?

@ArthurZucker 

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

The pull request https://github.com/huggingface/transformers/pull/29285 was aimed to make calculations of sin and cos of RoPE to be in float 32.


But it seems that changing device from cpu to cuda also produces different results. Though the difference is not so big.

To check this you may run the following code.

```import torch
vals = torch.linspace(0, 1, 30000, dtype=torch.float32)

computes = { 
    "cpu_32" : vals.float().cpu().cos(),
    "cuda_32" : vals.float().cuda().cos(),
    "cpu_16" : vals.half().cpu().cos(),
    "cuda_16": vals.half().cuda().cos()
}

def compare(x, y):
    return max(torch.max(torch.abs(x.to(y.device) - y)), torch.max(torch.abs(x - y.to(x.device)))).item()

keys = computes.keys()
print(end='\t')
for k in keys:
    print(k, end='\t\t')
print()
for k1 in keys:    
    print(k1, end='\t')
    for k2 in keys:
        print(f"{compare(computes[k1], computes[k2]):1.3e}", end='\t')
    print()
```
The output:

```
    	cpu_32		cuda_32		cpu_16		cuda_16		
cpu_32	0.000e+00	5.960e-08	4.389e-04	4.389e-04	
cuda_32	5.960e-08	0.000e+00	4.389e-04	4.389e-04	
cpu_16	4.389e-04	4.389e-04	0.000e+00	0.000e+00	
cuda_16	4.389e-04	4.389e-04	0.000e+00	0.000e+00
```
This table shows the maximum difference between calculations on different devices and using different data types.

You may see that all float16 computations are identical. But float32 are different for cuda and cpu.

Previously all sin and cos computations were performed on cpu. To maintain backward compatibility, I propose to run float32 computations on cpu.

Here
https://github.com/unslothai/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L142

```
142            emb = torch.cat((freqs, freqs), dim=-1)
143            cos = emb.cos()
144            sin = emb.sin()
```
change to
```
142            emb = torch.cat((freqs, freqs), dim=-1).cpu()
143            cos = emb.cos().to(device_type)
144            sin = emb.sin().to(device_type)
```

### Impact
According to my study, this difference in calculation of sin & cos embeddings impacts output logits and generated tokens.
The difference between values of output logits may exceed 10. More than 0.1% of output tokens may be changed in comparison to the original calculations.

### Expected behavior

RoPE sin and cos values are expected to be the same as in previous versions of transformers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different LlamaRotaryEmbedding in old and new versions of transformers #34657

System Info

Who can help?

Information

Tasks

Reproduction

Impact

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Different LlamaRotaryEmbedding in old and new versions of transformers #34657

Description

System Info

Who can help?

Information

Tasks

Reproduction

Impact

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions