Skip to content

Fix nativ tok#42874

Closed
itazap wants to merge 11 commits intomainfrom
fix_nativ_tok
Closed

Fix nativ tok#42874
itazap wants to merge 11 commits intomainfrom
fix_nativ_tok

Conversation

@itazap
Copy link
Copy Markdown
Collaborator

@itazap itazap commented Dec 15, 2025

fixes #42796

Fixes incorrect tokenization for Ministral-3 models by preserving and restoring the ByteLevel decoder from tokenizer.json when loading tokenizers. Also custom init methods were overwriting the ByteLevel decoder with a Metaspace decoder.

Happens because these models reference a LlamaTokenizerFast in their tokenizer_config.json on the hub. and LlamaTokenizer's init overrides tokenizer components

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: apertus, auto, bart, bigbird_pegasus, blenderbot, blenderbot_small, blip, blt, bridgetower, chameleon, clipseg, decision_transformer, dia, evolla, flaubert, gemma3n

@itazap itazap requested a review from ArthurZucker December 16, 2025 16:21
@jurgisp
Copy link
Copy Markdown

jurgisp commented Dec 31, 2025

Thanks @itazap, when do you think this could be merged?

@itazap itazap mentioned this pull request Jan 5, 2026
@itazap
Copy link
Copy Markdown
Collaborator Author

itazap commented Jan 5, 2026

Hello! Sorry actually this requires #42894 as the fix! will be merged in the next couple days

@itazap itazap closed this Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ministral-3-8B-Instruct tokenizer doesn't handle BPE markers properly

3 participants