Skip to content

Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex AttributeError#45359

Merged
ArthurZucker merged 1 commit intomainfrom
fix/kimi-k25-tokenizer-regression
Apr 13, 2026
Merged

Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex AttributeError#45359
ArthurZucker merged 1 commit intomainfrom
fix/kimi-k25-tokenizer-regression

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker commented Apr 10, 2026

Fixes #45356

Summary

  • Remove kimi_k25 from MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS: its remote TikTokenTokenizer is the only correct backend — the model has no tokenizer.json, and its added_tokens_decoder has non-sequential IDs (gaps + [UNK]/[PAD] at 163838/163839) that TokenizersBackend.add_tokens() can't reproduce, causing every token after ID 163588 to be assigned wrong IDs.
  • Fix _patch_mistral_regex: use tokenizer.pre_tokenizer instead of tokenizer.backend_tokenizer.pre_tokenizer — the method receives the raw tokenizers.Tokenizer, which doesn't have .backend_tokenizer.

Test plan

  • AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True).decode([163607]) returns '</think>'
  • Roundtrip encode/decode of <think>hello</think> works

…Error

Fixes #45356

Remove `kimi_k25` from `MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS` — its
remote `TikTokenTokenizer` is the only correct backend (no `tokenizer.json`,
non-sequential added-token IDs that `TokenizersBackend` cannot reproduce).

Also fix `_patch_mistral_regex`: the method receives the raw
`tokenizers.Tokenizer` object, which has `.pre_tokenizer` directly,
not `.backend_tokenizer.pre_tokenizer`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker requested a review from itazap April 10, 2026 12:26
@ArthurZucker ArthurZucker added the for patch Tag issues / labels that should be included in the next patch label Apr 10, 2026
Copy link
Copy Markdown
Collaborator

@itazap itazap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! the change in tokenization_utils_tokenizers.py is also covered here: #45317

@ArthurZucker ArthurZucker added this pull request to the merge queue Apr 13, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 13, 2026
@ArthurZucker ArthurZucker merged commit 282078b into main Apr 13, 2026
31 checks passed
@ArthurZucker ArthurZucker deleted the fix/kimi-k25-tokenizer-regression branch April 13, 2026 15:16
ArthurZucker added a commit that referenced this pull request Apr 13, 2026
…Error (#45359)

Fixes #45356

Remove `kimi_k25` from `MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS` — its
remote `TikTokenTokenizer` is the only correct backend (no `tokenizer.json`,
non-sequential added-token IDs that `TokenizersBackend` cannot reproduce).

Also fix `_patch_mistral_regex`: the method receives the raw
`tokenizers.Tokenizer` object, which has `.pre_tokenizer` directly,
not `.backend_tokenizer.pre_tokenizer`.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sirzechs66 pushed a commit to sirzechs66/transformers that referenced this pull request Apr 18, 2026
…Error (huggingface#45359)

Fixes huggingface#45356

Remove `kimi_k25` from `MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS` — its
remote `TikTokenTokenizer` is the only correct backend (no `tokenizer.json`,
non-sequential added-token IDs that `TokenizersBackend` cannot reproduce).

Also fix `_patch_mistral_regex`: the method receives the raw
`tokenizers.Tokenizer` object, which has `.pre_tokenizer` directly,
not `.backend_tokenizer.pre_tokenizer`.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

for patch Tag issues / labels that should be included in the next patch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression in Kimi-K2.5 tokenizer from 5.3.0 to 5.4.0: incorrect codec handling and misleading fix_mistral_regex warning

3 participants