Skip to content

Deepseek tokenizer produces incorrect results as of v5 (works in v4) #44779

@xenova

Description

@xenova

System Info

  • transformers version: 5.3.0
  • Platform: Linux-6.6.113+-x86_64-with-glibc2.35
  • Python version: 3.12.12
  • Huggingface_hub version: 1.6.0
  • Safetensors version: 0.7.0
  • Accelerate version: 1.13.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.10.0+cpu (NA)
  • Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('deepseek-ai/DeepSeek-R1')

text = "How are you doing?"
print(tokenizer.encode(text))
print(tokenizer.tokenize(text))
print(tokenizer.decode(tokenizer.encode(text)))

produces

[4117, 591, 12829, 62552, 33]
['How', 'are', 'you', 'doing', '?']
Howareyoudoing?

Expected behavior

Downgrading to transformers==4.57.6, running the same code as above produces

[0, 4117, 477, 440, 4843, 33]
['How', 'Ġare', 'Ġyou', 'Ġdoing', '?']
<|begin▁of▁sentence|>How are you doing?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions