Deepseek tokenizer produces incorrect results as of v5 (works in v4)

### System Info

- `transformers` version: 5.3.0
- Platform: Linux-6.6.113+-x86_64-with-glibc2.35
- Python version: 3.12.12
- Huggingface_hub version: 1.6.0
- Safetensors version: 0.7.0
- Accelerate version: 1.13.0
- Accelerate config: 	not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.10.0+cpu (NA)
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

@ArthurZucker and @itazap

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

```py
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('deepseek-ai/DeepSeek-R1')

text = "How are you doing?"
print(tokenizer.encode(text))
print(tokenizer.tokenize(text))
print(tokenizer.decode(tokenizer.encode(text)))
```

produces
```
[4117, 591, 12829, 62552, 33]
['How', 'are', 'you', 'doing', '?']
Howareyoudoing?
```

### Expected behavior

Downgrading to `transformers==4.57.6`, running the same code as above produces

```
[0, 4117, 477, 440, 4843, 33]
['How', 'Ġare', 'Ġyou', 'Ġdoing', '?']
<｜begin▁of▁sentence｜>How are you doing?
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepseek tokenizer produces incorrect results as of v5 (works in v4) #44779

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Deepseek tokenizer produces incorrect results as of v5 (works in v4) #44779

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions